How it works?

Generated code

The @jit macro turns the outer function to a @generated one, so that we can recompile with reoptimized source code at will.

The optimized code looks like this:

    x = get_some_x(i)
    if x isa FrequentType1
        calc_with_x(x) # Fast type-stable route
    elseif x isa FrequentType2
        calc_with_x(x) # Fast type-stable route
    .
    .
    .
    else
        calc_with_x(x) # Fallback to the dynamically dispatched call
    end

Iterated staging

Catwalk.jl uses a technique I call "iterated staging", which is essentially an outer loop which repetitively recompiles parts of the loop body.

Recompilation happens by encoding the current "stage" - the list of concrete types to speed up and the profiler config - into a type and passing an instance of that type, called the "JIT context" to the inner function in the loop body.

Only the type of the JIT context drives the compilation process, as it is the only data available to the @generated inner function.

JIT context basics

This is how the context looks like:

struct CallCtx{TProfiler, TFixtypes}
    profiler::TProfiler
end

Where TFixTypes encodes everything needed to generate the dispatch code, and TProfiler describes the profiler configuration used in the current batch.

TFixTypes is built with recursive type parameters that encode the stabilized types as a linked list. For example, to speed up FrequentType1 and FrequentType2, the optimizer generates a concrete type by recursively parametrizing the TypeListItem generic type:

struct TypeListItem{TThis, TNext} end
struct EmptyTypeList end

Catwalk.encode(FrequentType1, FrequentType2)

# Catwalk.TypeListItem{FrequentType1, Catwalk.TypeListItem{FrequentType2, Catwalk.EmptyTypeList}}

Passing this "type list" as part of the JIT context allows the @generated function to generate the type-stable routes.

Profilers

The other part of the context is the profiler. Two profilers are implemented at the time:

struct NoProfiler <: Profiler end

struct FullProfiler <: Profiler
    typefreqs::DataTypeFrequencies
end

The FullProfiler collects statistics from every call. It logs a call faster than a dynamic dispatch, but running it in every batch would still eat most of the cake, so it is sparsely used, with 1% probability by default (It is always active during the first two batches).

Explorer and the full JIT context

The last missing part is the explorer, which automatically connects the JIT compiler with the @jit-ed functions that run under its supervision.

This connection is not trivial because the @jit macro is only applied to a single function which is somewhere "inside" the batch, potentially in another package than the outer loop. It is possible to configure the optimizer manually, but the Explorer can automatically find the @jit-ed call sites that are called in the batch.

As a single JIT compiler can handle multiple call sites, the jitctx in reality is not a single CallCtx as described earlier, but a NamedTuple of them, plus an explorer:

struct OptimizerCtx{TCallCtxs, TExplorer}
    callctxs::TCallCtxs # NamedTuple of `CallCtx`s
    explorer::TExplorer
end

The explorer holds its id in its type, because exploration happens during compilation, when only its type is available.

struct BasicExplorer{TOptimizerId} <: Explorer end

Here Catwalk - just like many other meta-heavy Julia packages - violates the rule that a @generated function is not "allowed" to access mutable global state. The explorer logs the call site to a global dict, keyed with its id, from where the JIT compiler can read it out during the next batch.

It seems impossible to send back information from the compilation process without breaking this rule, and pushing the exploration to the tight loop is not feasible.

I think that this violation is acceptible (note that RuntimeGeneratedFunctions also does the same), but it is possible to turn off the Explorer, as described in the tuning guide.