FormulaCompiler.jl

FormulaCompiler implements a type-stable counterfactual vector system providing variable substitution with O(1) memory overhead without data duplication. This is particularly useful for policy analysis and treatment effect evaluation.

Key Features

  • Memory efficiency: Per-row evaluation with zero allocations
  • Computational performance: Improvements over traditional modelmatrix() approaches for single-row evaluations
  • Comprehensive compatibility: Supports all valid StatsModels.jl formulas, including complex interactions and mathematical functions
  • Categorical mixtures: Compile-time support for weighted categorical specifications for marginal effects
  • Scenario analysis: Memory-efficient variable override system for counterfactual analysis
  • Unified architecture: Single compilation pipeline accommodates diverse formula structures
  • Ecosystem integration: Compatible with GLM.jl, MixedModels.jl, and StandardizedPredictors.jl
  • Dual-backend derivatives: Memory-efficient finite differences and ForwardDiff automatic differentiation options (ForwarDiff is the strongly preferred default option)

Installation

using Pkg
Pkg.add(url = "https://github.com/emfeltham/FormulaCompiler.jl")

Quick Start

Workflow

Figure: Basic FormulaCompiler.jl workflow

using FormulaCompiler, GLM, DataFrames, Tables

# Fit your model normally
df = DataFrame(
    y = randn(1000),
    x = randn(1000),
    z = abs.(randn(1000)) .+ 0.1,
    group = categorical(rand(["A", "B", "C"], 1000))
)

model = lm(@formula(y ~ x * group + log(z)), df)

# Compile once for efficient repeated evaluation  
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

# Memory-efficient evaluation suitable for repeated calls
compiled(row_vec, data, 1)  # Zero allocations after warmup

Performance Comparison

Performance results across all tested formula types:

using BenchmarkTools

# Traditional approach (creates full model matrix)
@benchmark modelmatrix(model)[1, :]
# Traditional approach with allocation overhead

# FormulaCompiler (zero-allocation single row)
data = Tables.columntable(df)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))

@benchmark compiled(row_vec, data, 1)
# FormulaCompiler approach with zero allocations

Zero Allocations (verified in test suite):

  • Core row evaluation (compiled(row,data,i))
  • Scenario evaluation (CounterfactualVector)
  • FD Jacobian
  • AD Jacobian

Allocation Characteristics

FormulaCompiler.jl provides different allocation guarantees depending on the operation:

Core Model Evaluation

  • Zero allocations: modelrow!() and direct compiled() calls are 0 bytes after warmup
  • Performance: Fast per-row evaluation across all formula complexities
  • Validated: Test cases confirm zero-allocation performance

Derivative Operations

FormulaCompiler.jl provides computational primitives for derivatives with dual backend support:

BackendTypeAllocationsPerformanceAccuracyRecommendation
Automatic DifferentiationADEvaluator0 bytesFastMachine precisionStrongly preferred default
Finite DifferencesFDEvaluator0 bytesFast~1e-8-
# Build evaluator with automatic differentiation (strongly recommended)
de = derivativeevaluator(:ad, compiled, data, vars)  # Automatic differentiation

# Compute Jacobian with zero allocations
J = Matrix{Float64}(undef, length(compiled), length(vars))
derivative_modelrow!(J, de, row)  # 0 bytes, machine precision

# For marginal effects, use Margins.jl
using Margins
g = Vector{Float64}(undef, length(vars))
marginal_effects_eta!(g, de, beta, row)  # Marginal effects on η

Use Cases

  • Monte Carlo simulations with large data and many model evaluations
  • Bootstrap resampling: Repeated matrix construction
  • Marginal effects: cf. Margins.jl which is built on FormulaCompiler.jl

Next Steps