Getting Started

This guide will walk you through the basics of using FormulaCompiler.jl for efficient model matrix evaluation.

Installation

FormulaCompiler.jl is currently available from GitHub:

using Pkg
Pkg.add(url="https://github.com/emfeltham/FormulaCompiler.jl")

Once installed, you can load the package:

Workflow Overview

Here's how FormulaCompiler.jl works from start to finish:

Diagram

using FormulaCompiler

Basic Workflow

The typical workflow with FormulaCompiler.jl involves three steps:

Fit your model using standard Julia statistical packages
Compile the formula for optimized evaluation
Evaluate rows with zero allocations

Let's walk through a complete example:

Step 1: Fit Your Model

using FormulaCompiler, GLM, DataFrames, Tables, CategoricalArrays

# Create some sample data
df = DataFrame(
    y = randn(1000),
    x = randn(1000),
    z = abs.(randn(1000)) .+ 0.1,
    group = categorical(rand(["A", "B", "C"], 1000)),
    treatment = rand(Bool, 1000)
)

# Fit a model using GLM.jl (or any compatible package)
model = lm(@formula(y ~ x * group + log(z) + treatment), df)

Important: FormulaCompiler requires all categorical variables to use CategoricalArrays.jl. String variables in models are not supported. Always convert string columns to categorical format using categorical(column) before model fitting.

Step 2: Compile the Formula

Convert your data to column-table format for best performance:

data = Tables.columntable(df)

Compile the formula:

compiled = compile_formula(model, data)

The compiled formula contains all the information needed for zero-allocation evaluation.

Step 3: Evaluate Rows

Pre-allocate an output vector:

row_vec = Vector{Float64}(undef, length(compiled))

Now evaluate any row with zero allocations:

compiled(row_vec, data, 1)    # Evaluate row 1
compiled(row_vec, data, 100)  # Evaluate row 100
compiled(row_vec, data, 500)  # Evaluate row 500

Each call achieves zero allocations with good performance.

Alternative Interfaces

FormulaCompiler.jl provides several interfaces for different use cases:

Convenient Interface (Allocating)

For quick prototyping or when allocation performance isn't critical:

# Single row evaluation
row_values = modelrow(model, data, 1)

# Multiple rows
row_indices = [1, 10, 50, 100]
matrix = modelrow(model, data, row_indices)

Object-Based Interface

Create a reusable evaluator object:

evaluator = ModelRowEvaluator(model, df)

# Zero-allocation evaluation
result = evaluator(1)           # Returns new vector
evaluator(row_vec, 1)          # In-place evaluation

Batch Evaluation

Evaluate multiple rows at once:

# Pre-allocate matrix
matrix = Matrix{Float64}(undef, 10, length(compiled))

# Evaluate rows 1-10 in batch
for i in 1:10
    compiled(view(matrix, i, :), data, i)
end

How Compilation Works

FormulaCompiler.jl uses a unified compilation pipeline based on position mapping:

Decompose the formula into primitive operations (load, constant, unary, binary, contrast, copy)
Allocate scratch and output positions for all intermediate and final values
Embed those positions as compile-time type parameters
Return a UnifiedCompiled object that evaluates rows with zero allocations

Performance Verification

You can verify zero-allocation performance using BenchmarkTools.jl:

using BenchmarkTools

# Benchmark the zero-allocation interface
@benchmark $compiled($row_vec, $data, 1)

You should see zero allocations and good evaluation performance. Absolute times vary by hardware and Julia version; focus on allocation behavior and relative trends. See the Benchmark Protocol for reproduction details:

BenchmarkTools.Trial: Many samples with many evaluations.
 Memory estimate: 0 bytes, allocs estimate: 0.

Compare this to the traditional approach:

@benchmark modelmatrix($model)[1, :]

Troubleshooting

Common Issues and Solutions

Compilation Errors

Problem: MethodError during compile_formula

# Error: MethodError: no method matching compile_formula(::SomeUnsupportedModel, ::NamedTuple)

Solution: Ensure you're using a supported model type (GLM, MixedModels) or check package compatibility.

Problem: BoundsError or dimension mismatches

# Error: BoundsError: attempt to access 500-element Vector at index [1001]

Solution: Verify that your data contains the expected number of rows and that row_idx is within bounds.

Performance Issues

Problem: Non-zero allocations detected

# @benchmark shows non-zero memory allocations

Solutions:

Use Tables.columntable(df) instead of DataFrame directly
Ensure output vector is pre-allocated with correct size
Check for type instabilities in your data (mixed types in columns)
Verify all categorical variables use CategoricalArrays.jl

Problem: Slower than expected performance

# Evaluation takes longer than anticipated

Solutions:

Let compilation warm up with a few calls before benchmarking
Use the caching interface (modelrow! with cache=true) for repeated evaluations
Check for complex formulas that may benefit from simplification
Ensure data is in optimal format (Tables.columntable)

Data Format Issues

Problem: String variables in models

# Error: FormulaCompiler does not support raw string variables
df.category = ["A", "B", "C", "A", "B"]  # String vector
model = lm(@formula(y ~ x + category), df)  # Will cause issues

Solution: Convert all categorical data to CategoricalArrays.jl format before model fitting:

# Required: Convert strings to categorical
df.category = categorical(df.category)
model = lm(@formula(y ~ x + category), df)  # Now works correctly

Problem: Categorical contrasts or unexpected behavior

# Error with categorical contrasts or unexpected contrast behavior

Solutions:

Ensure all categorical variables use CategoricalArrays.jl: categorical(column)
Verify factor levels are consistent between training and evaluation data
Check contrast specifications in model fitting: contrasts = Dict(:var => EffectsCoding())

Problem: Missing values causing errors

# Error: missing values not supported

Solution: Remove or impute missing values before model fitting and compilation.

Memory Issues

Problem: Large memory usage despite zero-allocation claims

# High memory usage in application

Solutions:

Use direct data modification with merge() instead of creating full data copies
Reuse compiled formulas rather than recompiling
Clear model cache if accumulating many different compilations: clear_model_cache!()

Performance Validation

Verify your setup achieves expected performance:

using BenchmarkTools

# Check zero allocations
result = @benchmark $compiled($row_vec, $data, 1)
@assert result.memory == 0 "Expected zero allocations, got $(result.memory) bytes"

# Check cache effectiveness  
@time modelrow!(row_vec, model, data, 1; cache=true)  # First call
@time modelrow!(row_vec, model, data, 2; cache=true)  # Should be much faster