Architecture
Technical overview of FormulaCompiler.jl’s unified, zero‑allocation compilation and execution model.
Design Philosophy
Move expensive work to compile time; keep runtime simple and type‑stable.
- Compile‑time specialization: All positions and operations are baked into types
- Type stability: No dynamic dispatch in hot paths
- Memory reuse: Preallocate once; reuse across evaluations
- Position mapping: Address everything by compile‑time positions, not names
System Overview
Unified Compilation Pipeline
The compilation process transforms statistical formulas into optimized evaluators:
Compilation produces a single position‑mapped evaluator (UnifiedCompiled) in four steps:
- Decompose terms → operations
- Parse the schema‑applied formula and convert into primitive ops:
LoadOp,ConstantOp,UnaryOp,BinaryOp,ContrastOp,CopyOp
- Allocate positions
- Assign scratch positions for intermediates and indices for final outputs
- Cache term → position mapping to reuse computed intermediates
- Specialize operation types
- Embed positions and keys as type parameters (e.g.,
LoadOp{:x, 3}) - Convert op vector to a tuple for type‑stable execution
- Package into
UnifiedCompiled
- Store op tuple and a preallocated scratch buffer sized to maximum position
- Provide a callable that writes directly into a user‑supplied output vector
Result: compiled(row_vec, data, row) runs in tens of nanoseconds with 0 allocations after warmup (typical; see Benchmark Protocol).
Operation Set
Primitive operations form an acyclic execution plan:
LoadOp{Column, OutPos}: data[column][row] → scratch[OutPos]ConstantOp{Value, OutPos}: literal → scratch[OutPos]UnaryOp{Func, InPos, OutPos}: f(scratch[InPos]) → scratch[OutPos]BinaryOp{Func, In1, In2, OutPos}: f(scratch[In1], scratch[In2]) → scratch[OutPos]ContrastOp{Column, OutPositions}: categorical expansion → scratch[each(OutPositions)]CopyOp{InPos, OutIdx}: scratch[InPos] → output[OutIdx]
All operation ordering respects dependencies to ensure each input is ready when used.
Zero‑Allocation Execution
Runtime evaluation is pure array indexing with concrete types:
- Scratch:
Vector{Float64}(undef, ScratchSize)allocated once insideUnifiedCompiled - Output: Provided by the caller; must have length
length(compiled) - Execution: Iterate the typed op tuple and update scratch/output in place
Path to zero allocations:
- Preallocate scratch once per compiled formula
- No temporary arrays or dynamic dispatch during execution
- Column access uses direct field lookup from a NamedTuple (column table)
For complex formulas (>10 operations) and derivative computation, the system uses targeted metaprogramming to maintain zero-allocation performance. See Metaprogramming for implementation details.
CounterfactualVector System
Unified Row-Wise Architecture: Population analysis = individual analysis + averaging
CounterfactualVectorhierarchy: Type-stable single-row perturbations for all data typesNumericCounterfactualVector{T}: Numeric variables with automatic type conversionBoolCounterfactualVector: Boolean variablesCategoricalCounterfactualVector{T,R}: Categorical variables with contrast supportCategoricalMixtureCounterfactualVector{T}: Categorical mixtures for profile effectsTypedCounterfactualVector{T,V}: Generic fallback for other types
Memory Efficiency: O(1) memory usage vs O(n) for data copying approaches
Performance: Simple loops achieve 10-100x speedup over data copying
Type Stability: Concrete types throughout, no
Anytypes on hot paths
Population Analysis Pattern:
# Simple loop pattern for efficient population analysis
population_effects = Vector{Float64}(undef, n_rows)
for row in 1:n_rows
# Use existing row-wise functions with CounterfactualVector perturbations
population_effects[row] = compute_individual_effect(row)
end
population_ame = mean(population_effects)Integration
- GLM.jl: Works with all linear and generalized linear models
- MixedModels.jl: Automatically extracts fixed‑effects formula via
fixed_effects_form - StandardizedPredictors.jl: ZScore standardization supported at compile time
Extensibility
Add an operation or transformation by composing existing ops during decomposition, or extend model support via dispatch that extracts a StatsModels @formula and delegates to the unified compiler.
Performance Monitoring
Check allocations and timings with BenchmarkTools:
@allocated compiled(row_vec, data, 1) # Expect 0
@benchmark $compiled($row_vec, $data, 1)Figure generation
- During docs builds, diagrams under
docs/src/assets/*.mmdare automatically regenerated to SVG if the Mermaid CLI (mmdc) is available (seedocs/make.jl).
Unified Row-Wise Architecture
Design Philosophy: Population analysis = individual analysis + averaging
Key Features:
- Population system eliminated: Clean, simplified architecture
- CounterfactualVector system: All data types supported including mixtures
- Type-stable throughout: Concrete types for zero-allocation performance
- API simplified: Clean exports focused on row-wise operations only
Performance Characteristics:
- Zero allocations: 0 bytes for core evaluation, derivatives, and marginal effects
- Fast per-row: Timings vary by system, but typically <100ns per row
- Memory efficient: O(1) memory for counterfactual analysis vs O(n) for data copying
Future Directions
- Parallel row evaluation for batches
- Expanded function library and transformations
- Streaming and distributed execution patterns
- Enhanced categorical mixture support