Architecture

Technical overview of FormulaCompiler.jl’s unified, zero‑allocation compilation and execution model.

Design Philosophy

Move expensive work to compile time; keep runtime simple and type‑stable.

  • Compile‑time specialization: All positions and operations are baked into types
  • Type stability: No dynamic dispatch in hot paths
  • Memory reuse: Preallocate once; reuse across evaluations
  • Position mapping: Address everything by compile‑time positions, not names

System Overview

Diagram

Unified Compilation Pipeline

The compilation process transforms statistical formulas into optimized evaluators:

Diagram

Compilation produces a single position‑mapped evaluator (UnifiedCompiled) in four steps:

  1. Decompose terms → operations
  • Parse the schema‑applied formula and convert into primitive ops:
    • LoadOp, ConstantOp, UnaryOp, BinaryOp, ContrastOp, CopyOp
  1. Allocate positions
  • Assign scratch positions for intermediates and indices for final outputs
  • Cache term → position mapping to reuse computed intermediates
  1. Specialize operation types
  • Embed positions and keys as type parameters (e.g., LoadOp{:x, 3})
  • Convert op vector to a tuple for type‑stable execution
  1. Package into UnifiedCompiled
  • Store op tuple and a preallocated scratch buffer sized to maximum position
  • Provide a callable that writes directly into a user‑supplied output vector

Result: compiled(row_vec, data, row) runs in tens of nanoseconds with 0 allocations after warmup (typical; see Benchmark Protocol).

Operation Set

Primitive operations form an acyclic execution plan:

  • LoadOp{Column, OutPos}: data[column][row] → scratch[OutPos]
  • ConstantOp{Value, OutPos}: literal → scratch[OutPos]
  • UnaryOp{Func, InPos, OutPos}: f(scratch[InPos]) → scratch[OutPos]
  • BinaryOp{Func, In1, In2, OutPos}: f(scratch[In1], scratch[In2]) → scratch[OutPos]
  • ContrastOp{Column, OutPositions}: categorical expansion → scratch[each(OutPositions)]
  • CopyOp{InPos, OutIdx}: scratch[InPos] → output[OutIdx]

All operation ordering respects dependencies to ensure each input is ready when used.

Zero‑Allocation Execution

Runtime evaluation is pure array indexing with concrete types:

  • Scratch: Vector{Float64}(undef, ScratchSize) allocated once inside UnifiedCompiled
  • Output: Provided by the caller; must have length length(compiled)
  • Execution: Iterate the typed op tuple and update scratch/output in place

Path to zero allocations:

  • Preallocate scratch once per compiled formula
  • No temporary arrays or dynamic dispatch during execution
  • Column access uses direct field lookup from a NamedTuple (column table)

For complex formulas (>10 operations) and derivative computation, the system uses targeted metaprogramming to maintain zero-allocation performance. See Metaprogramming for implementation details.

CounterfactualVector System

Unified Row-Wise Architecture: Population analysis = individual analysis + averaging

  • CounterfactualVector hierarchy: Type-stable single-row perturbations for all data types

    • NumericCounterfactualVector{T}: Numeric variables with automatic type conversion
    • BoolCounterfactualVector: Boolean variables
    • CategoricalCounterfactualVector{T,R}: Categorical variables with contrast support
    • CategoricalMixtureCounterfactualVector{T}: Categorical mixtures for profile effects
    • TypedCounterfactualVector{T,V}: Generic fallback for other types
  • Memory Efficiency: O(1) memory usage vs O(n) for data copying approaches

  • Performance: Simple loops achieve 10-100x speedup over data copying

  • Type Stability: Concrete types throughout, no Any types on hot paths

Population Analysis Pattern:

# Simple loop pattern for efficient population analysis
population_effects = Vector{Float64}(undef, n_rows)
for row in 1:n_rows
    # Use existing row-wise functions with CounterfactualVector perturbations
    population_effects[row] = compute_individual_effect(row)
end
population_ame = mean(population_effects)

Integration

  • GLM.jl: Works with all linear and generalized linear models
  • MixedModels.jl: Automatically extracts fixed‑effects formula via fixed_effects_form
  • StandardizedPredictors.jl: ZScore standardization supported at compile time

Extensibility

Add an operation or transformation by composing existing ops during decomposition, or extend model support via dispatch that extracts a StatsModels @formula and delegates to the unified compiler.

Performance Monitoring

Check allocations and timings with BenchmarkTools:

@allocated compiled(row_vec, data, 1)  # Expect 0
@benchmark $compiled($row_vec, $data, 1)

Figure generation

  • During docs builds, diagrams under docs/src/assets/*.mmd are automatically regenerated to SVG if the Mermaid CLI (mmdc) is available (see docs/make.jl).

Unified Row-Wise Architecture

Design Philosophy: Population analysis = individual analysis + averaging

Key Features:

  • Population system eliminated: Clean, simplified architecture
  • CounterfactualVector system: All data types supported including mixtures
  • Type-stable throughout: Concrete types for zero-allocation performance
  • API simplified: Clean exports focused on row-wise operations only

Performance Characteristics:

  • Zero allocations: 0 bytes for core evaluation, derivatives, and marginal effects
  • Fast per-row: Timings vary by system, but typically <100ns per row
  • Memory efficient: O(1) memory for counterfactual analysis vs O(n) for data copying

Future Directions

  • Parallel row evaluation for batches
  • Expanded function library and transformations
  • Streaming and distributed execution patterns
  • Enhanced categorical mixture support