Basic Usage
This page covers the core functionality of FormulaCompiler.jl with practical examples.
Core Interfaces
FormulaCompiler.jl provides three main interfaces for different use cases:
1. Zero-Allocation Interface (Fastest)
This is the primary interface for performance-critical applications:
using FormulaCompiler, GLM, DataFrames, Tables
# Setup
df = DataFrame(x = randn(1000), y = randn(1000), group = rand(["A", "B"], 1000))
model = lm(@formula(y ~ x * group), df)
data = Tables.columntable(df)
# Compile once
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
# Use many times (zero allocations)
compiled(row_vec, data, 1) # Zero allocations
compiled(row_vec, data, 100) # Zero allocations2. Convenient Interface (Allocating)
For quick prototyping or when allocation isn't critical:
# Single row (returns new vector)
row_1 = modelrow(model, data, 1)
# Multiple rows (returns matrix)
rows_subset = modelrow(model, data, [1, 10, 50, 100])
# Range of rows
rows_range = modelrow(model, data, 1:10)3. Object-Based Interface
Create a reusable evaluator object:
evaluator = ModelRowEvaluator(model, df)
# Allocating version
result = evaluator(1)
# Zero-allocation version
row_vec = Vector{Float64}(undef, length(evaluator))
evaluator(row_vec, 1)Understanding the Compiled Object
Compiled formulas contain important information:
compiled = compile_formula(model, data)
# Number of terms in the model matrix (columns)
length(compiled) # e.g., 4 for intercept + x + group_B + x:group_B
# You can call it like a function
row_vec = Vector{Float64}(undef, length(compiled))
compiled(row_vec, data, row_index)Batch Operations
Multiple Row Evaluation
# Pre-allocate matrix for multiple rows
n_rows = 100
matrix = Matrix{Float64}(undef, n_rows, length(compiled))
# Evaluate multiple rows efficiently
for i in 1:n_rows
compiled(view(matrix, i, :), data, i)
end
# Or specific rows
specific_rows = [1, 5, 10, 50, 100]
matrix_subset = Matrix{Float64}(undef, length(specific_rows), length(compiled))
for (idx, row) in enumerate(specific_rows)
compiled(view(matrix_subset, idx, :), data, row)
endWorking with Views
For memory efficiency, you can work with matrix views:
big_matrix = Matrix{Float64}(undef, 1000, length(compiled))
# Fill specific rows using views
for i in 1:10
row_view = view(big_matrix, i, :)
compiled(row_view, data, i)
endData Format Considerations
Column Tables (Recommended)
For best performance, convert DataFrames to column tables:
# Convert once, reuse many times
data = Tables.columntable(df)
compiled = compile_formula(model, data)Using DataFrames (via Tables.columntable)
Convert DataFrames to column tables once, then reuse the result:
data = Tables.columntable(df) # Convert once
compiled = compile_formula(model, data)
compiled(row_vec, data, 1) # Zero allocations after warmupSupported Formula Features
FormulaCompiler.jl handles all standard StatsModels.jl formula syntax:
Basic Terms
# Continuous variables
@formula(y ~ x + z)
# Transformations
@formula(y ~ log(x) + sqrt(z) + x^2)
# Boolean conditions
@formula(y ~ (x > 0) + (z < mean(z)))Boolean Variables
Boolean variables (Vector{Bool}) are treated as continuous variables, matching StatsModels behavior exactly:
# Boolean data - treated as continuous
df = DataFrame(
outcome = randn(100),
x = randn(100),
treated = rand(Bool, 100) # true/false values
)
model = lm(@formula(outcome ~ x + treated), df)
compiled = compile_formula(model, Tables.columntable(df))
# Numerical encoding: false → 0.0, true → 1.0
# This matches StatsModels exactlyFor counterfactual analysis, use data modification with merge():
# Get dataset size
n_rows = length(data.treated)
# Boolean scenarios - individual counterfactuals
data_treated = merge(data, (treated = fill(true, n_rows),)) # All treated
compiled(output, data_treated, 1) # Evaluate treated scenario
data_control = merge(data, (treated = fill(false, n_rows),)) # All control
compiled(output, data_control, 1) # Evaluate control scenario
# Numeric scenarios - population analysis (70% treated probability)
data_partial = merge(data, (treated = fill(0.7, n_rows),))
compiled(output, data_partial, 1) # Evaluate partial treatment scenarioKey Points:
Vector{Bool}columns work automatically - no conversion needed- Produces identical results to StatsModels
- Supports both boolean (
true/false) and numeric (0.7) overrides - Zero-allocation performance maintained
Categorical Variables
Required: FormulaCompiler only supports categorical variables created with CategoricalArrays.jl. Raw string variables are not supported.
using CategoricalArrays
# Required: Convert string columns to categorical
df.group = categorical(df.group)
@formula(y ~ x + group) # Automatic contrast coding
# Not supported: Raw string variables
df.category = ["A", "B", "C"] # String vector
@formula(y ~ x + category) # Will cause compilation errors
# Correct approach
df.category = categorical(["A", "B", "C"])
@formula(y ~ x + category) # Works correctlyInteractions
# Two-way interactions
@formula(y ~ x * group) # Expands to: x + group + x:group
# Three-way interactions
@formula(y ~ x * y * z)
# Function interactions
@formula(y ~ log(x) * group)Complex Formulas
@formula(y ~ x * group + log(z) * treatment + sqrt(abs(w)) + (x > mean(x)))Error Handling
FormulaCompiler.jl provides clear error messages:
# Invalid row index
try
compiled(row_vec, data, 1001) # Only 1000 rows
catch BoundsError
println("Row index out of bounds")
end
# Mismatched output vector size
try
wrong_size = Vector{Float64}(undef, 3) # Should be length(compiled)
compiled(wrong_size, data, 1)
catch DimensionMismatch
println("Output vector has wrong size")
endMemory Management
Pre-allocation Best Practices
# Good: Pre-allocate and reuse
row_vec = Vector{Float64}(undef, length(compiled))
for i in 1:1000
compiled(row_vec, data, i)
# Process row_vec...
end
# Bad: Allocate each time
for i in 1:1000
row_vec = modelrow(model, data, i) # Allocates!
# Process row_vec...
endLarge Dataset Considerations
# For very large datasets, process in chunks
chunk_size = 1000
n_chunks = div(nrow(df), chunk_size)
results = Matrix{Float64}(undef, nrow(df), length(compiled))
for chunk in 1:n_chunks
start_idx = (chunk - 1) * chunk_size + 1
end_idx = min(chunk * chunk_size, nrow(df))
# Evaluate each row in the chunk
for i in start_idx:end_idx
compiled(view(results, i, :), data, i)
end
endValidation and Debugging
Compilation Validation
Verify that compilation produces expected results:
using FormulaCompiler, GLM, Tables
# Setup test case
df = DataFrame(
y = randn(100),
x = randn(100),
group = rand(["A", "B", "C"], 100)
)
model = lm(@formula(y ~ x * group), df)
data = Tables.columntable(df)
# Compile and validate
compiled = compile_formula(model, data)
# Check dimensions
@assert length(compiled) == size(modelmatrix(model), 2) "Column count mismatch"
# Validate against GLM's modelmatrix for first few rows
mm = modelmatrix(model)
row_vec = Vector{Float64}(undef, length(compiled))
for i in 1:min(5, nrow(df))
compiled(row_vec, data, i)
expected = mm[i, :]
if !isapprox(row_vec, expected; rtol=1e-12)
@warn "Mismatch in row $i" row_vec expected
else
println("Row $i matches GLM modelmatrix")
end
endPerformance Validation
Verify zero-allocation performance:
using BenchmarkTools
# Test zero allocations
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
# Benchmark evaluation
result = @benchmark $compiled($row_vec, $data, 1)
# Validate performance characteristics (absolute times vary by hardware and Julia version)
@assert result.memory == 0 "Expected zero allocations, got $(result.memory) bytes"
@assert result.allocs == 0 "Expected zero allocations, got $(result.allocs) allocations"
println("Zero-allocation validation passed")
println("Memory: $(result.memory) bytes")
println("Allocations: $(result.allocs)")Data Integrity Validation
Ensure data format compatibility:
function validate_data_compatibility(model, data)
try
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
compiled(row_vec, data, 1)
println("Data format compatible")
return true
catch e
@error "Data format incompatible" exception=e
return false
end
end
# Test your data
is_compatible = validate_data_compatibility(model, data)Common Workflow Patterns
Pattern 1: Monte Carlo Simulation
function monte_carlo_analysis(model, base_data, n_simulations=10000)
compiled = compile_formula(model, base_data)
row_vec = Vector{Float64}(undef, length(compiled))
results = Vector{Float64}(undef, n_simulations)
for i in 1:n_simulations
# Random row selection
random_row = rand(1:length(first(base_data)))
compiled(row_vec, base_data, random_row)
# Compute prediction (example: linear predictor)
prediction = dot(coef(model), row_vec)
results[i] = prediction
end
return results
end
# Usage
mc_results = monte_carlo_analysis(model, data, 10000)
println("Mean prediction: $(mean(mc_results))")
println("Std prediction: $(std(mc_results))")Pattern 2: Cross-Validation Support
function evaluate_fold(model, train_data, test_data, test_indices)
# Compile using training data structure
compiled = compile_formula(model, train_data)
row_vec = Vector{Float64}(undef, length(compiled))
# Evaluate on test data
predictions = Vector{Float64}(undef, length(test_indices))
for (i, test_row) in enumerate(test_indices)
compiled(row_vec, test_data, test_row)
predictions[i] = dot(coef(model), row_vec)
end
return predictions
end
# Example usage in cross-validation
train_data = Tables.columntable(df_train)
test_data = Tables.columntable(df_test)
fold_predictions = evaluate_fold(model, train_data, test_data, 1:nrow(df_test))Pattern 3: Streaming Data Processing
function process_streaming_data(model, data_stream)
# Compile once with example data
example_data = first(data_stream)
compiled = compile_formula(model, example_data)
row_vec = Vector{Float64}(undef, length(compiled))
processed_results = Float64[]
for data_batch in data_stream
n_rows = length(first(data_batch))
for row in 1:n_rows
compiled(row_vec, data_batch, row)
# Process each row with zero allocations
result = dot(coef(model), row_vec)
push!(processed_results, result)
end
end
return processed_results
endPattern 4: Performance-Critical Loops
function high_frequency_evaluation(model, data, row_indices)
# Pre-compile and pre-allocate everything
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
results = Vector{Float64}(undef, length(row_indices))
# Inner loop with zero allocations
@inbounds for (i, row_idx) in enumerate(row_indices)
compiled(row_vec, data, row_idx)
# Custom computation with pre-allocated vectors
results[i] = sum(row_vec) # Example: sum of predictors
end
return results
endIntegration with Statistical Ecosystem
GLM.jl Integration
using GLM, Distributions
# Linear regression
linear_model = lm(@formula(y ~ x + group), df)
compiled_linear = compile_formula(linear_model, data)
# Logistic regression
df_binary = DataFrame(
success = rand(Bool, 1000), # Boolean response: true/false → 1.0/0.0
x = randn(1000),
group = rand(["A", "B"], 1000)
)
logit_model = glm(@formula(success ~ x + group), df_binary, Binomial(), LogitLink())
compiled_logit = compile_formula(logit_model, Tables.columntable(df_binary))
# Poisson regression
df_count = DataFrame(
count = rand(Poisson(2), 1000),
x = randn(1000),
exposure = rand(0.5:0.1:2.0, 1000)
)
poisson_model = glm(@formula(count ~ x + log(exposure)), df_count, Poisson(), LogLink())
compiled_poisson = compile_formula(poisson_model, Tables.columntable(df_count))MixedModels.jl Integration
using MixedModels
# Mixed effects model (extracts fixed effects only)
df_mixed = DataFrame(
y = randn(1000),
x = randn(1000),
treatment = rand(Bool, 1000), # Boolean predictor: treated/untreated
subject = rand(1:100, 1000),
cluster = rand(1:50, 1000)
)
mixed_model = fit(MixedModel, @formula(y ~ x + treatment + (1|subject) + (1|cluster)), df_mixed)
compiled_mixed = compile_formula(mixed_model, Tables.columntable(df_mixed))
# Note: Only fixed effects (x + treatment) are compiled
# Random effects are not included in the compiled evaluatorCustom Contrasts
using StatsModels
# Define custom contrast coding
contrasts_dict = Dict(
:group => EffectsCoding(), # Effects coding for group
:treatment => DummyCoding(base=false) # Dummy coding with true as reference
)
model_contrasts = lm(@formula(y ~ x + group + treatment), df, contrasts=contrasts_dict)
compiled_contrasts = compile_formula(model_contrasts, data)Debugging and Troubleshooting
Common Validation Checks
function comprehensive_validation(model, data)
println("=== FormulaCompiler Validation ===")
# 1. Compilation check
try
compiled = compile_formula(model, data)
println("Compilation successful")
println(" Formula length: $(length(compiled))")
catch e
println("Compilation failed: $e")
return false
end
# 2. Zero allocation check
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
alloc_result = @allocated compiled(row_vec, data, 1)
if alloc_result == 0
println("Zero allocations achieved")
else
println("WARNING: Non-zero allocations: $alloc_result bytes")
end
# 3. Correctness check (first 3 rows)
if applicable(modelmatrix, model)
mm = modelmatrix(model)
max_check = min(3, size(mm, 1))
for i in 1:max_check
compiled(row_vec, data, i)
expected = mm[i, :]
if isapprox(row_vec, expected; rtol=1e-12)
println("Row $i matches reference implementation")
else
println("Row $i mismatch detected")
println(" Expected: $(expected[1:min(3, length(expected))])...")
println(" Got: $(row_vec[1:min(3, length(row_vec))])...")
return false
end
end
end
println("All validation checks passed")
return true
end
# Run comprehensive validation
validation_result = comprehensive_validation(model, data)Performance Diagnostics
function diagnose_performance(model, data)
println("=== Performance Diagnostics ===")
# Compilation timing
compilation_time = @elapsed compile_formula(model, data)
println("Compilation time: $(round(compilation_time * 1000, digits=1))ms")
# Setup for evaluation timing
compiled = compile_formula(model, data)
row_vec = Vector{Float64}(undef, length(compiled))
# Warmup (important for accurate timing)
for _ in 1:100
compiled(row_vec, data, 1)
end
# Memory allocation check
alloc_check = @allocated compiled(row_vec, data, 1)
println("Memory allocations: $alloc_check bytes")
# Performance benchmark
bench_result = @benchmark $compiled($row_vec, $data, 1)
println("Evaluation performance:")
println(" Memory: $(bench_result.memory) bytes")
println(" Allocations: $(bench_result.allocs)")
# Cache effectiveness test
println("\nCache effectiveness:")
cache_time_1 = @elapsed modelrow!(row_vec, model, data, 1; cache=true)
cache_time_2 = @elapsed modelrow!(row_vec, model, data, 2; cache=true)
println(" First call (with compilation): $(round(cache_time_1 * 1000, digits=2))ms")
println(" Second call (cached): $(round(cache_time_2 * 1000000, digits=1))μs")
return bench_result
end
# Run diagnostics
performance_result = diagnose_performance(model, data)For advanced performance optimization techniques, see the Performance Guide.