Advanced Features
FormulaCompiler.jl provides sophisticated features for advanced statistical computing scenarios.
Memory-Efficient Override System
The override system allows you to create "what-if" scenarios without duplicating data in memory.
OverrideVector
The foundation of the scenario system is OverrideVector, which provides a memory-efficient constant vector:
using FormulaCompiler
# Traditional approach: allocates 8MB for 1M rows
traditional = fill(42.0, 1_000_000)
# FormulaCompiler: allocates ~32 bytes
efficient = OverrideVector(42.0, 1_000_000)
# Both provide identical interface
traditional[500_000] == efficient[500_000] # true
length(efficient) == 1_000_000 # true
# But massive memory savings
sizeof(traditional) ÷ sizeof(efficient) # ~250,000x smaller!Data Scenarios
Create scenarios with variable overrides:
using DataFrames, Tables
df = DataFrame(
x = randn(1000),
y = randn(1000),
treatment = rand(Bool, 1000),
group = rand(["A", "B", "C"], 1000)
)
data = Tables.columntable(df)
# Create baseline scenario
baseline = create_scenario("baseline", data)
# Create treatment scenarios
treatment_on = create_scenario("treatment_on", data;
treatment = true,
dose = 100.0 # Add new variable
)
treatment_off = create_scenario("treatment_off", data;
treatment = false,
dose = 0.0
)
# Policy scenarios
policy_scenario = create_scenario("policy", data;
x = mean(df.x), # Set to population mean
group = "A", # Override categorical
regulatory_flag = true # Add policy variable
)Scenario Evaluation
Use scenarios with compiled formulas:
model = lm(@formula(y ~ x * treatment + group), df)
compiled = compile_formula(model, data) # Compile with original data
row_vec = Vector{Float64}(undef, length(compiled))
# Evaluate different scenarios for the same individual
compiled(row_vec, baseline.data, 1) # Original data
compiled(row_vec, treatment_on.data, 1) # With treatment
compiled(row_vec, treatment_off.data, 1) # Without treatment
compiled(row_vec, policy_scenario.data, 1) # Policy scenarioScenario Grids
Generate all combinations of scenario parameters:
# Create comprehensive policy analysis
policy_grid = create_scenario_grid("policy_analysis", data, Dict(
:treatment => [false, true],
:dose => [50.0, 100.0, 150.0],
:region => ["North", "South", "East", "West"]
))
# This creates 2×3×4 = 24 scenarios
length(policy_grid) # 24
# Evaluate all scenarios for a specific individual
results = Matrix{Float64}(undef, length(policy_grid), length(compiled))
for (i, scenario) in enumerate(policy_grid)
compiled(view(results, i, :), scenario.data, 1)
end
# Each row represents one scenario combinationDynamic Scenario Modification
Modify scenarios after creation:
scenario = create_scenario("dynamic", data; x = 1.0)
# Add new overrides
set_override!(scenario, :y, 100.0)
set_override!(scenario, :new_var, 42.0)
# Bulk updates
update_scenario!(scenario;
x = 2.0,
z = 0.5,
treatment = true
)
# Remove overrides
remove_override!(scenario, :y)
# Check current overrides
get_overrides(scenario) # Dict of current overridesAdvanced Compilation Features
Introspection and Performance
Profile compilation performance:
using BenchmarkTools
# Benchmark compilation time
@benchmark compile_formula($model, $data)Derivatives and Contrasts
Near-Zero-Allocation Automatic Differentiation: Compute per-row derivatives of the model row with respect to selected variables using optimized ForwardDiff-based system.
Performance Characteristics
- Core evaluation: Exactly 0 allocations (modelrow!, compiled functions)
- Finite differences (FD): Exactly 0 allocations (optimized implementation)
- ForwardDiff derivatives: ≤512 bytes per call (ForwardDiff internals, unavoidable)
- Marginal effects: ≤512 bytes per call for AD backend (optimized with preallocated buffers)
- Allocation efficiency: >99.75% compared to naive AD approaches
- Validation: Cross-validated against finite differences for robustness
We are actively working on a more efficient automatic differentiation implementation that will reduce or eliminate allocations in the AD backend. The current ForwardDiff-based implementation provides excellent accuracy with minimal allocations (≤512 bytes), while the finite differences backend already achieves zero allocations for users who prioritize memory efficiency over speed.
ForwardDiff-Based Derivatives
using ForwardDiff, GLM
compiled = compile_formula(model, data)
vars = [:x, :z] # choose continuous vars
de = build_derivative_evaluator(compiled, data; vars=vars)
# Pre-allocate Jacobian matrix (reused across calls)
J = Matrix{Float64}(undef, length(compiled), length(vars))
derivative_modelrow!(J, de, 1) # ≤512 bytes (AD backend)
# Marginal effects η = Xβ (uses preallocated buffers)
β = coef(model)
g_eta = marginal_effects_eta(de, β, 1) # ≤512 bytes (AD backend)
# GLM mean μ = g⁻¹(η) with link functions
g_mu = marginal_effects_mu(de, β, 1; link=LogitLink()) # ≤512 bytes (AD backend)Finite-difference fallback (simple and robust):
J_fd = derivative_modelrow_fd(compiled, data, 1; vars=vars)Discrete contrasts for categorical variables:
Δ = contrast_modelrow(compiled, data, 1; var=:group3, from="A", to="B")Tips:
- Variable selection:
continuous_variables(compiled, data)returns a convenient list of continuous symbols present in the compiled ops. Pass a subset tobuild_derivative_evaluatorviavars=.... - Chunking:
build_derivative_evaluator(...; chunk=:auto)usesForwardDiff.Chunk{N}whereN = length(vars). You can pass an explicitForwardDiff.Chunk{K}()if you want to tune performance for largerN. - Links:
marginal_effects_musupports common GLM links includingIdentityLink(),LogLink(),LogitLink(),ProbitLink(),CloglogLink(),CauchitLink(),InverseLink(), andSqrtLink()(andInverseSquareLink()when available). Extendable as needed.
Mixed Models (Fixed Effects)
Derivatives target the fixed-effects design (random effects are intentionally excluded):
using MixedModels
df = DataFrame(y = randn(500), x = randn(500), z = abs.(randn(500)) .+ 0.1,
group = categorical(rand(1:20, 500)))
mm = fit(MixedModel, @formula(y ~ 1 + x + z + (1|group)), df; progress=false)
data = Tables.columntable(df)
compiled = compile_formula(mm, data) # fixed-effects only
vars = [:x, :z]
de = build_derivative_evaluator(compiled, data; vars=vars)
J = Matrix{Float64}(undef, length(compiled), length(vars))
derivative_modelrow!(J, de, 1)Architecture and Optimization
The derivative system achieves near-zero allocations through:
- Preallocated buffers: Jacobian matrices, gradient vectors, and temporary arrays stored in
DerivativeEvaluator - Typed closures: Compile-time specialization eliminates runtime dispatch
- Prebuilt data structures: Override vectors and merged data reused across calls
- Optimized memory layout: All allocations front-loaded during evaluator construction
Performance Benchmarking
using BenchmarkTools
# Build evaluator once (one-time cost)
de = build_derivative_evaluator(compiled, data; vars=[:x, :z])
J = Matrix{Float64}(undef, length(compiled), length(de.vars))
# Benchmark derivatives (≤512 bytes, AD backend)
@benchmark derivative_modelrow!($J, $de, 25)
# Benchmark marginal effects (≤512 bytes with preallocated buffers, AD backend)
β = coef(model)
g = Vector{Float64}(undef, length(de.vars))
@benchmark marginal_effects_eta!($g, $de, $β, 25)Complex Formula Support
Nested Functions
FormulaCompiler.jl handles complex nested functions:
@formula(y ~ log(sqrt(abs(x))) + exp(sin(z)) * group)Boolean Logic
Sophisticated boolean expressions:
@formula(y ~ (x > 0) * (z < mean(z)) + (group == "A") * log(w))Custom Functions
Define custom functions for use in formulas:
# Define custom function
custom_transform(x) = x > 0 ? log(1 + x) : -log(1 - x)
# Use in formula (requires function to be defined in scope)
@formula(y ~ custom_transform(x) + group)High-Performance Patterns
Avoiding Allocation in Loops
# Pre-compile and pre-allocate
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
results = Matrix{Float64}(undef, n_simulations, length(compiled))
# Monte Carlo simulation with zero allocations
for sim in 1:n_simulations
for row in 1:nrow(df)
compiled(row_vec, data, row)
# Apply some transformation to row_vec
results[sim, :] .= some_transformation(row_vec)
# Continue processing...
end
endBatch Processing Large Datasets
function process_large_dataset(model, data, batch_size=1000)
compiled = compile_formula(model, data)
n_rows = Tables.rowcount(data)
n_cols = length(compiled)
results = Vector{Matrix{Float64}}()
for start_idx in 1:batch_size:n_rows
end_idx = min(start_idx + batch_size - 1, n_rows)
batch_size_actual = end_idx - start_idx + 1
batch_result = Matrix{Float64}(undef, batch_size_actual, n_cols)
modelrow!(batch_result, compiled, data, start_idx:end_idx)
push!(results, batch_result)
end
return results
endParallel Processing
Combine with Julia's parallel processing capabilities:
using Distributed
@everywhere using FormulaCompiler, DataFrames, Tables
function parallel_evaluation(model, data, row_indices)
compiled = compile_formula(model, data)
results = @distributed (vcat) for row_idx in row_indices
row_vec = Vector{Float64}(undef, length(compiled))
compiled(row_vec, data, row_idx)
row_vec' # Return as row matrix
end
return results
endIntegration with Optimization
Gradient-Based Optimization
FormulaCompiler.jl works well with automatic differentiation:
using ForwardDiff
function objective_function(params, compiled_formula, data, target)
# Update data with new parameters
modified_data = (; data..., x = params[1], z = params[2])
row_vec = Vector{Float64}(undef, length(compiled_formula))
compiled_formula(row_vec, modified_data, 1)
# Compute loss
return sum((row_vec .- target).^2)
end
# Use with ForwardDiff for gradients
compiled = compile_formula(model, data)
target = [1.0, 2.0, 3.0, 4.0] # Target model matrix row
gradient = ForwardDiff.gradient(
params -> objective_function(params, compiled, data, target),
[0.0, 1.0] # Initial parameters
)Bayesian Analysis Integration
Efficient model evaluation in MCMC samplers:
using MCMCChains
function log_likelihood(params, compiled_formula, data, y_observed)
# Extract parameters
β = params[1:length(compiled_formula)]
σ = exp(params[end]) # Log-scale for positivity
n_obs = length(y_observed)
row_vec = Vector{Float64}(undef, length(compiled_formula))
ll = 0.0
for i in 1:n_obs
compiled_formula(row_vec, data, i)
μ = dot(β, row_vec)
ll += logpdf(Normal(μ, σ), y_observed[i])
end
return ll
end
# Use in MCMC sampler (pseudocode)
# sampler = MCMCSampler(log_likelihood, compiled, data, y)Memory and Performance Monitoring
Allocation Tracking
Monitor allocation performance:
using BenchmarkTools
function check_allocations(compiled, data, n_tests=1000)
row_vec = Vector{Float64}(undef, length(compiled))
# Warm up
compiled(row_vec, data, 1)
# Benchmark
result = @benchmark begin
for i in 1:$n_tests
$compiled($row_vec, $data, i % nrow($data) + 1)
end
end
return result
end
# Should show 0 allocations
benchmark_result = check_allocations(compiled, data)Memory Usage Analysis
using Profile
function profile_memory_usage(model, data, n_evaluations=10000)
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
# Profile memory
Profile.clear_malloc_data()
for i in 1:n_evaluations
compiled(row_vec, data, i % nrow(data) + 1)
end
# Analyze results
# (Use ProfileView.jl or similar for visualization)
end