API Reference

API reference for FormulaCompiler.jl functions and types.

Core Compilation Functions

FormulaCompiler.compile_formulaFunction
compile_formula(model, data) -> UnifiedCompiled

Compile a fitted statistical model into a zero-allocation, type-specialized evaluator.

Transforms statistical formulas into optimized computational engines using position mapping that achieves ~50ns per row evaluation with zero allocations. The resulting evaluator provides constant-time row access regardless of dataset size.

Arguments

  • model: Fitted statistical model (GLM.LinearModel, GLM.GeneralizedLinearModel, MixedModels.LinearMixedModel, etc.)
  • data: Data in Tables.jl format (preferably Tables.columntable(df) for optimal performance)

Returns

  • UnifiedCompiled{T,Ops,S,O}: Callable evaluator with embedded position mappings
    • Call as compiled(output_vector, data, row_index) for zero-allocation evaluation
    • length(compiled) returns number of model matrix columns

Performance Characteristics

  • Compilation: One-time cost for complex formulas
  • Evaluation: Zero bytes allocated after warmup
  • Memory: O(output_size) scratch space, reused across all evaluations
  • Scaling: Evaluation time independent of dataset size

Supported Models

  • Linear models: GLM.lm(@formula(y ~ x + group), df)
  • Generalized linear models: GLM.glm(@formula(success ~ x), df, Binomial(), LogitLink())
  • Mixed models: MixedModels.fit(MixedModel, @formula(y ~ x + (1|group)), df) (fixed effects only)
  • Custom contrasts: Models with DummyCoding(), EffectsCoding(), HelmertCoding(), etc.
  • Standardized predictors: Models with ZScore() standardization

Formula Features

  • Basic terms: x, log(z), x^2, (x > 0), integer and float variables
  • Categorical variables: Must use CategoricalArrays.jl format - raw strings not supported
  • Interactions: x * group, x * y * z, log(x) * group
  • Functions: log, exp, sqrt, sin, cos, abs, ^ (integer and fractional powers)
  • Boolean conditions: (x > 0), (z >= mean(z)), (group == "A")
  • Complex formulas: x * log(abs(z)) * group + sqrt(y) + (w > threshold)

Data Requirements

  • Categorical variables: Must use categorical(column) before model fitting
  • Missing values: Not supported - remove with dropmissing() or impute before compilation
  • Table format: Use Tables.columntable(df) for optimal performance

Example

using FormulaCompiler, GLM, DataFrames, Tables, CategoricalArrays

# Fit model
df = DataFrame(
    y = randn(1000), 
    x = randn(1000), 
    group = categorical(rand(["A", "B"], 1000))  # Required: use categorical()
)
model = lm(@formula(y ~ x * group + log(abs(x) + 1)), df)

# Compile once
data = Tables.columntable(df)  # Convert for optimal performance
compiled = compile_formula(model, data)

# Use many times (zero allocations)
output = Vector{Float64}(undef, length(compiled))
compiled(output, data, 1)     # Zero allocations
compiled(output, data, 500)   # Zero allocations

# Substantial speedup compared to modelmatrix(model)[row, :]

Mixed Models Example

using MixedModels
mixed = fit(MixedModel, @formula(y ~ x + treatment + (1|subject)), df)
compiled = compile_formula(mixed, data)  # Compiles fixed effects: y ~ x + treatment

See also: modelrow!, ModelRowEvaluator

source
compile_formula(formula::StatsModels.FormulaTerm, data) -> UnifiedCompiled

Compile a formula directly without a fitted model for zero-allocation evaluation.

This overload enables compilation from raw formulas, bypassing model fitting when only the computational structure is needed. Useful for custom model implementations or direct formula evaluation workflows.

Arguments

  • formula::StatsModels.FormulaTerm: Formula specification (e.g., from @formula(y ~ x + group))
  • data: Data in Tables.jl format (preferably Tables.columntable(df))

Returns

  • UnifiedCompiled{T,Ops,S,O}: Zero-allocation evaluator, same interface as model-based compilation

Performance

  • Compilation: Fast for complex formulas
  • Evaluation: Zero bytes allocated
  • Memory: Identical performance to model-based compilation

Example

using StatsModels, FormulaCompiler, Tables

# Direct formula compilation
formula = @formula(y ~ x * group + log(z))
data = Tables.columntable(df)
compiled = compile_formula(formula, data)

# Zero-allocation evaluation
output = Vector{Float64}(undef, length(compiled))
compiled(output, data, 1)  # Zero allocations

Use Cases

  • Custom model implementations requiring direct formula evaluation
  • Performance-critical applications avoiding model fitting overhead
  • Exploratory analysis with formula variations
  • Integration with external statistical frameworks

See also: compile_formula(model, data) for model-based compilation

source
FormulaCompiler.get_or_compile_formulaFunction
get_or_compile_formula(model, data)

Get cached compiled formula or compile new one with semantic type-aware caching.

Cache Key Strategy

Creates cache key based on:

  1. Model object (coefficients, structure)
  2. Column names (formula structure)
  3. Semantic type categories (compilation behavior)

Type Category Benefits

  • Better cache hits: Vector{Int} and Vector{Float64} share cache entry
  • Correct mixture handling: CategoricalArray vs CategoricalMixture distinguished
  • Future-proof: New types can be added to category system

Examples

# These share a cache entry (both :numeric):
data1 = (x = Float64[1.0, 2.0], y = ...)
data2 = (x = Int[1, 2], y = ...)  # Cache HIT ✓

# These get separate entries (different compilation):
data3 = (edu = categorical(["HS"]), ...)      # :categorical
data4 = (edu = mix("HS" => 0.5, "C" => 0.5), ...)  # :mixture - Cache MISS ✓
source

Model Row Evaluation

FormulaCompiler.modelrowFunction
modelrow(model, data, row_idx) -> Vector{Float64}

Evaluate a single model matrix row, returning a new vector (allocating version).

Convenient interface for when pre-allocation is not practical. Uses internal formula compilation and caching for performance optimization, though the non-allocating modelrow! interface is preferred for performance-critical code.

Arguments

  • model: Fitted statistical model (GLM, MixedModel, etc.)
  • data: Data in Tables.jl format
  • row_idx::Int: Row index to evaluate (1-based)

Returns

  • Vector{Float64}: New vector containing model matrix row values

Performance

  • First call: Includes one-time compilation cost
  • Subsequent calls: Fast evaluation plus allocation cost for vector creation
  • Memory: Allocates new vector each call
  • Caching: Automatically caches compiled formula for reuse

Example

using FormulaCompiler, GLM

model = lm(@formula(y ~ x * group + log(z)), df)
data = Tables.columntable(df)

# Convenient single-row evaluation
row_1 = modelrow(model, data, 1)      # First call (includes compilation)
row_2 = modelrow(model, data, 2)      # Subsequent calls (uses cached compilation)
row_100 = modelrow(model, data, 100)  # Fast (uses cached compilation)

When to Use

  • Prototyping: Quick analysis and exploration
  • Small datasets: When allocation overhead is negligible
  • Convenience: When code simplicity outweighs performance requirements

Performance Alternative

For zero-allocation performance in loops, use modelrow!:

output = Vector{Float64}(undef, length(compile_formula(model, data)))
for i in 1:n_iterations
    modelrow!(output, compiled, data, i)  # Zero allocations each iteration
end

See also: modelrow!, ModelRowEvaluator, compile_formula

source
modelrow(model, data, row_indices) -> Matrix{Float64}

Evaluate multiple rows and return a new matrix (allocating version). Uses compiled formulas for optimal performance.

Example

matrix = modelrow(model, data, [1, 5, 10])  # Returns Matrix{Float64}
source
modelrow(compiled_formula, data, row_idx) -> Vector{Float64}

Evaluate a single row with pre-compiled compiled formula.

Example

compiled = compile_formula(model, data)
row_values = modelrow(compiled, data, 1)  # Returns Vector{Float64}
source
modelrow(compiled_formula, data, row_indices) -> Matrix{Float64}

Evaluate multiple rows with pre-compiled compiled formula.

Example

compiled = compile_formula(model, data)
matrix = modelrow(compiled, data, [1, 5, 10])  # Returns Matrix{Float64}
source
FormulaCompiler.modelrow!Function
modelrow!(output, compiled, data, row_idx) -> output

Evaluate a single model matrix row in-place with zero allocations.

The primary interface for high-performance row evaluation. This function provides zero-allocation evaluation, making it suitable for tight computational loops and performance-critical applications.

Arguments

  • output::AbstractVector{Float64}: Pre-allocated output vector (modified in-place)
    • Must have length ≥ length(compiled)
    • Contents will be overwritten with model matrix row values
  • compiled: Compiled formula from compile_formula(model, data)
  • data: Data in Tables.jl format (preferably Tables.columntable(df) for best performance)
  • row_idx::Int: Row index to evaluate (1-based indexing)

Returns

  • output: The same vector passed in, now containing the evaluated model matrix row

Performance

  • Memory: Zero bytes allocated after warmup
  • Scaling: Constant time regardless of dataset size or formula complexity
  • Validation: Tested across 2000+ diverse formula configurations

Example

using FormulaCompiler, GLM, Tables

# Setup (one-time cost)
model = lm(@formula(y ~ x * group + log(z)), df)
data = Tables.columntable(df)
compiled = compile_formula(model, data)
output = Vector{Float64}(undef, length(compiled))

# High-performance evaluation (repeated many times)
modelrow!(output, compiled, data, 1)    # Zero allocations
modelrow!(output, compiled, data, 100)  # Zero allocations

# Monte Carlo simulation example
for i in 1:1_000_000
    row_idx = rand(1:nrow(df))
    modelrow!(output, compiled, data, row_idx)  # Zero allocations each call
    # Process output...
end

Error Handling

  • BoundsError: If row_idx exceeds data size
  • DimensionMismatch: If output vector is too small
  • Validates arguments in debug builds

See also: modelrow for allocating version, compile_formula, ModelRowEvaluator

source
modelrow!(row_vec, model, data, row_idx; cache=true)

Evaluate a single row of the model matrix in-place with automatic compilation.

Arguments

  • row_vec::AbstractVector{Float64}: Pre-allocated output vector (modified in-place)
  • model: Statistical model (GLM, MixedModel, etc.)
  • data: Data in Tables.jl format
  • row_idx::Int: Row index to evaluate
  • cache::Bool: Whether to cache compiled formula (default: true)

Returns

  • row_vec: The same vector passed in, now containing the evaluated row

Example

model = lm(@formula(y ~ x + group), df)
data = Tables.columntable(df)
row_vec = Vector{Float64}(undef, size(modelmatrix(model), 2))
modelrow!(row_vec, model, data, 1)
Note

First call compiles the formula. Subsequent calls reuse cached version when cache=true.

source
FormulaCompiler.ModelRowEvaluatorType
ModelRowEvaluator{T,Ops,S,O}

Object-oriented interface for reusable, pre-compiled model evaluation.

Combines compiled formula, data, and output buffer into a single object that can be called repeatedly for both allocating and non-allocating row evaluation. Useful when the same model and data will be evaluated many times.

Type Parameters

  • T: Element type (typically Float64)
  • Ops: Compiled operations tuple type
  • S: Scratch buffer size
  • O: Output vector size

Fields

  • compiled::UnifiedCompiled: Pre-compiled formula
  • data::NamedTuple: Data in column-table format
  • row_vec::Vector{Float64}: Internal buffer for non-allocating calls

Constructors

ModelRowEvaluator(model, df::DataFrame)      # Converts DataFrame to column table
ModelRowEvaluator(model, data::NamedTuple)   # Uses data directly

Interface

# Allocating interface - returns new vector
result = evaluator(row_idx)

# Non-allocating interface - uses provided vector  
evaluator(output_vector, row_idx)

Performance

  • Construction: One-time compilation cost
  • Allocating calls: Fast evaluation plus allocation cost
  • Non-allocating calls: Zero bytes allocated
  • Memory: Minimal overhead beyond compiled formula and data reference

Example

using FormulaCompiler, GLM

# Create evaluator (one-time setup)
model = lm(@formula(y ~ x * group + log(z)), df)
evaluator = ModelRowEvaluator(model, df)

# Allocating interface (convenient)
row_1 = evaluator(1)      # Returns Vector{Float64}
row_2 = evaluator(100)    # Returns Vector{Float64}

# Non-allocating interface (fast)
output = Vector{Float64}(undef, length(evaluator))
evaluator(output, 1)      # Zero allocations
evaluator(output, 100)    # Zero allocations

# Batch processing
results = Matrix{Float64}(undef, 1000, length(evaluator))
for i in 1:1000
    evaluator(view(results, i, :), i)  # Zero allocations
end

When to Use

  • Repeated evaluation: Same model and data used many times
  • Object-oriented style: Prefer objects over function calls
  • Mixed interfaces: Need both allocating and non-allocating evaluation
  • Clean encapsulation: Bundle model, data, and buffer management

See also: modelrow!, modelrow, compile_formula

source

Derivatives

FormulaCompiler provides computational primitives for computing derivatives of model matrix rows with respect to continuous variables. These functions enable zero-allocation Jacobian computation using either automatic differentiation (ForwardDiff) or finite differences.

For marginal effects, standard errors, and complete statistical workflows, see Margins.jl.

Evaluator Construction

Recommended: Use the unified dispatcher for user-facing code:

# Automatic differentiation (preferred)
de = derivativeevaluator(:ad, compiled, data, [:x, :z])

# Finite differences
de = derivativeevaluator(:fd, compiled, data, [:x, :z])

Advanced: Direct constructor functions (primarily for internal use):

Missing docstring.

Missing docstring for derivativeevaluator. Check Documenter's build log for details.

FormulaCompiler.derivativeevaluator_fdFunction
derivativeevaluator_fd(compiled, data, vars) -> FDEvaluator

Create a finite differences specialized FDEvaluator using Float64 counterfactual vectors.

Returns a concrete FDEvaluator with only FD infrastructure, no field pollution from AD. Uses NumericCounterfactualVector{Float64} for type-stable counterfactual operations.

source
FormulaCompiler.derivativeevaluator_adFunction
derivativeevaluator_ad(compiled, data, vars) -> ADEvaluator

Create an automatic differentiation specialized ADEvaluator using Dual counterfactual vectors.

Returns a concrete ADEvaluator with only AD infrastructure, no field pollution from FD. Uses NumericCounterfactualVector{Dual{...}} for type-stable dual number operations.

source

Jacobian Computation

FormulaCompiler.derivative_modelrow!Function
derivative_modelrow!(J, de::ADEvaluator, row) -> J

Primary automatic differentiation API - zero allocations via ForwardDiff.jacobian!.

Use cached ForwardDiff configuration for zero allocations. Replaces manual dual construction with ForwardDiff's optimized jacobian! routine.

Arguments

  • J::AbstractMatrix{Float64}: Preallocated Jacobian buffer of size (n_terms, n_vars)
  • de::ADEvaluator: AD evaluator built by derivativeevaluator(:ad, compiled, data, vars)
  • row::Int: Row index to evaluate (1-based indexing)

Returns

  • J: The same matrix passed in, now containing J[i,j] = ∂X[i]/∂vars[j] for the specified row

Performance Characteristics

  • Memory: 0 bytes allocated (cached buffers and ForwardDiff config)
  • Speed: Target ~60ns with ForwardDiff.jacobian! optimization
  • Accuracy: Machine precision derivatives via ForwardDiff dual arithmetic

Example

using FormulaCompiler, GLM

# Setup model
model = lm(@formula(y ~ x + z), df)
data = Tables.columntable(df)
compiled = compile_formula(model, data)

# Build AD evaluator
de = derivativeevaluator(:ad, compiled, data, [:x, :z])

# Zero-allocation Jacobian computation
J = Matrix{Float64}(undef, length(compiled), length(de.vars))
derivative_modelrow!(J, de, 1)  # 0 bytes allocated
source
derivative_modelrow!(J, de::FDEvaluator, row) -> J

Primary finite differences API - zero allocations, concrete type dispatch.

Computes full Jacobian matrix ∂X[i]/∂vars[j] using central differences with adaptive step sizing. Matches automatic_diff.jl signature for seamless backend switching.

Performance Characteristics

  • Memory: 0 bytes allocated (uses pre-allocated FDEvaluator buffers)
  • Speed: ~65ns per variable with mathematical optimizations
  • Accuracy: Adaptive step sizing balances truncation/roundoff error

Mathematical Method

Central differences: ∂f/∂x ≈ [f(x+h) - f(x-h)] / (2h) Step sizing: h = ε^(1/3) * max(1, |x|) for numerical stability

Arguments

  • J::AbstractMatrix{Float64}: Pre-allocated Jacobian buffer of size (n_terms, n_vars)
  • de::FDEvaluator: Pre-built evaluator from derivativeevaluator_fd(compiled, data, vars)
  • row::Int: Row index to evaluate (1-based indexing)

Returns

  • J: The same matrix passed in, containing J[i,j] = ∂X[i]/∂vars[j]

Example

using FormulaCompiler, GLM

# Setup model and data
model = lm(@formula(y ~ x * group + log(abs(z) + 1)), df)
data = Tables.columntable(df)
compiled = compile_formula(model, data)

# Build FD evaluator
de_fd = derivativeevaluator_fd(compiled, data, [:x, :z])

# Zero-allocation finite differences
J = Matrix{Float64}(undef, length(compiled), length(de_fd.vars))
derivative_modelrow!(J, de_fd, 1)  # 0 bytes allocated

See also: derivativeevaluator_fd

source
Missing docstring.

Missing docstring for derivative_modelrow. Check Documenter's build log for details.

Variable Identification

FormulaCompiler.continuous_variablesFunction
continuous_variables(compiled, data) -> Vector{Symbol}

Identify continuous variables suitable for derivative computation from a compiled formula.

Analyzes compiled operations to distinguish between continuous variables (suitable for differentiation) and categorical variables (requiring discrete analysis). Essential for determining valid variable sets for derivative evaluators and marginal effects computation.

Arguments

  • compiled::UnifiedCompiled: Compiled formula from compile_formula(model, data)
  • data::NamedTuple: Data in column-table format (from Tables.columntable(df))

Returns

  • Vector{Symbol}: Sorted list of continuous variable names
    • Includes: Float64, Int64, Int32, Int variables used in LoadOp operations
    • Excludes: Variables appearing only in ContrastOp operations (categorical contrasts)
    • Excludes: Boolean variables (treated as categorical regardless of numeric type)

Classification Algorithm

  1. Operation analysis: Scan compiled operations for LoadOp vs ContrastOp usage
  2. Type filtering: Verify variables have Real element types in data
  3. Boolean exclusion: Remove Bool variables (categorical by convention)
  4. Categorical exclusion: Remove variables only appearing in contrast operations

Example

using FormulaCompiler, GLM, CategoricalArrays

# Mixed variable types
df = DataFrame(
    y = randn(1000),
    price = randn(1000),          # Float64 - continuous
    quantity = rand(1:100, 1000), # Int64 - continuous
    available = rand(Bool, 1000), # Bool - categorical
    category = categorical(rand(["A", "B", "C"], 1000))  # Categorical - categorical
)

model = lm(@formula(y ~ price + quantity + available + category), df)
compiled = compile_formula(model, Tables.columntable(df))

# Identify continuous variables
continuous_vars = continuous_variables(compiled, Tables.columntable(df))
# Returns: [:price, :quantity]

# Use for derivative evaluator construction
de_fd = derivativeevaluator_fd(compiled, Tables.columntable(df), continuous_vars)
de_ad = derivativeevaluator_ad(compiled, Tables.columntable(df), continuous_vars)

Use Cases

  • Pre-validation: Check variable suitability before building derivative evaluators
  • Automatic selection: Programmatically identify all differentiable variables
  • Error prevention: Avoid attempting derivatives on categorical variables
  • Model introspection: Understand variable roles in compiled formulas

Implementation Details

  • Scans LoadOp operations for direct variable usage (continuous indicators)
  • Identifies ContrastOp operations for categorical variable detection
  • Applies type checking to ensure Real element types in the actual data
  • Returns sorted list for consistent ordering across calls

See also: derivativeevaluator_fd, derivativeevaluator_ad, derivative_modelrow!

source

Computational primitives for GLM link function derivatives (used by Margins.jl for computing marginal effects on the mean response).

Missing docstring.

Missing docstring for _dmu_deta. Check Documenter's build log for details.

Missing docstring.

Missing docstring for _d2mu_deta2. Check Documenter's build log for details.

FormulaCompiler.supported_link_functionsFunction
supported_link_functions() -> Vector{String}

Return list of GLM link functions with implemented dmudeta methods.

Note: Link function support is now determined by Julia's method dispatch. Any link function with a dmudeta method will work automatically. This function provides a convenience list of commonly tested functions.

Example

links = supported_link_functions()
println("Common GLM links: ", join(links, ", "))
source

Categorical Contrasts

FormulaCompiler.ContrastEvaluatorType
ContrastEvaluator{T, Ops, S, O, NTMerged, CounterfactualTuple}

Zero-allocation evaluator for categorical and binary variable contrasts.

Provides efficient discrete marginal effects computation by pre-allocating all buffers and pre-computing categorical level mappings. Eliminates the ~2KB allocation overhead of the basic contrast_modelrow! function for batch contrast operations.

Uses typed counterfactual vectors for type-stable, zero-allocation performance.

Fields

  • compiled: Base compiled formula evaluator
  • vars: Variables available for contrast computation
  • data_counterfactual: Counterfactual data structure for variable substitution
  • counterfactuals: Tuple of typed CounterfactualVector{T} subtypes for each variable
  • y_from_buf: Pre-allocated buffer for "from" level evaluation
  • y_to_buf: Pre-allocated buffer for "to" level evaluation
  • row: Current row being processed

Performance

  • Zero allocations after construction for all contrast operations
  • Type stability via typed counterfactual vectors
  • Buffer reuse across multiple contrasts and rows
  • Type specialization for compiled formula operations

Usage

# One-time setup
evaluator = contrastevaluator(compiled, data, [:treatment, :education])
contrast_buf = Vector{Float64}(undef, length(compiled))

# Fast repeated contrasts (zero allocations)
for row in 1:n_rows
    contrast_modelrow!(contrast_buf, evaluator, row, :treatment, "Control", "Drug")
    # Process contrast_buf...
end
source
FormulaCompiler.contrastevaluatorFunction
contrastevaluator(compiled, data, vars) -> ContrastEvaluator

Construct a ContrastEvaluator for efficient categorical and binary contrast computation.

Pre-allocates all necessary buffers and pre-computes categorical level mappings to eliminate allocations during contrast evaluation.

Arguments

  • compiled: Result from compile_formula(model, data)
  • data: Column-table data as NamedTuple
  • vars: Vector of variable symbols available for contrasts

Returns

ContrastEvaluator configured for zero-allocation contrast computation.

Performance Notes

  • One-time cost: Setup involves building override structures and categorical mappings
  • Categorical optimization: Level mappings computed once, reused for all contrasts
  • Memory efficiency: Buffers sized exactly for the compiled formula

Example

# Setup for categorical contrasts
evaluator = contrastevaluator(compiled, data, [:group, :region, :binary_var])

# Zero-allocation usage
contrast_buf = Vector{Float64}(undef, length(compiled))
contrast_modelrow!(contrast_buf, evaluator, 1, :group, "Control", "Treatment")
source
FormulaCompiler.CategoricalLevelMapType
CategoricalLevelMap{Var, LevelTuple}

Stores pre-computed level mappings for a categorical variable in contrast evaluators.

Similar to ContrastOp, this struct uses type parameters for compile-time specialization while storing runtime level data as a field.

Type Parameters

  • Var::Symbol: Variable name (e.g., :group, :treatment)
  • LevelTuple: Type of the levels tuple (e.g., NTuple{3, Tuple{String, CategoricalValue{UInt32}}})

Fields

  • levels: Tuple of (level, CategoricalValue) pairs preserving natural level types

Example

# String categorical with 3 levels
CategoricalLevelMap{:group, NTuple{3, Tuple{String, CategoricalValue{UInt32}}}}(
    (("Control", catval1), ("Treatment", catval2), ("Placebo", catval3))
)

# Integer categorical with 5 levels
CategoricalLevelMap{:age_group, NTuple{5, Tuple{Int64, CategoricalValue{UInt32}}}}(
    ((1, catval1), (2, catval2), (3, catval3), (4, catval4), (5, catval5))
)

Performance

  • Zero allocations: All types concrete, fully specialized
  • Natural types: No String conversion needed for Int/Symbol levels
  • Fast lookup: Linear search through small tuple (2-10 levels typical)
source
FormulaCompiler.contrast_modelrow!Function
contrast_modelrow!(Δ, evaluator, row, var, from, to) -> Δ

Compute discrete contrast using pre-allocated ContrastEvaluator (zero allocations).

Evaluates Δ = X(var=to) - X(var=from) using the evaluator's pre-allocated buffers and pre-computed categorical mappings for optimal performance.

Arguments

  • Δ::AbstractVector{Float64}: Output contrast vector (modified in-place)
  • evaluator::ContrastEvaluator: Pre-configured contrast evaluator
  • row::Int: Row index to evaluate
  • var::Symbol: Variable to contrast (must be in evaluator.vars)
  • from: Reference level (baseline)
  • to: Target level (comparison)

Performance

  • Zero allocations - uses pre-allocated buffers from evaluator
  • Categorical optimization - uses pre-computed level mappings
  • Type specialization - compiled formula operations fully optimized

Error Handling

  • Validates that var exists in evaluator's variable list
  • Handles both categorical and numeric variable types
  • Provides clear error messages for invalid level specifications

Example

evaluator = contrastevaluator(compiled, data, [:treatment])
contrast_buf = Vector{Float64}(undef, length(compiled))

# Zero-allocation contrast computation
contrast_modelrow!(contrast_buf, evaluator, 1, :treatment, "Control", "Drug")
# contrast_buf now contains the discrete effect vector
source
FormulaCompiler.contrast_gradient!Function
contrast_gradient!(∇β, evaluator, row, var, from, to, β, [link]) -> ∇β

Compute parameter gradients for discrete effects: ∂(discrete_effect)/∂β - zero allocations.

Computes the gradient of discrete marginal effects with respect to model parameters using the mathematical formula:

  • Linear scale (η): ∇β = ΔX = X₁ - X₀ (contrast vector)
  • Response scale (μ): ∇β = g'(η₁) × X₁ - g'(η₀) × X₀ (chain rule with link derivatives)

This enables uncertainty quantification via the delta method: SE = √(∇β' Σ ∇β).

Arguments

  • ∇β::AbstractVector{Float64}: Output gradient vector (modified in-place)
  • evaluator::ContrastEvaluator: Pre-configured contrast evaluator
  • row::Int: Row index to evaluate
  • var::Symbol: Variable to contrast (must be in evaluator.vars)
  • from: Reference level (baseline)
  • to: Target level (comparison)
  • β::AbstractVector{<:Real}: Model coefficients (used only for response-scale computation)
  • link: GLM link function (optional, defaults to linear scale)

Returns

  • ∇β: The same vector passed in, containing parameter gradients ∂(discrete_effect)/∂β

Performance

  • Zero allocations - uses pre-allocated buffers from evaluator
  • Link function support - handles all GLM links (Identity, Log, Logit, etc.)
  • Type flexibility - accepts any Real coefficient type, converts internally

Mathematical Method

Linear Scale (default):

discrete_effect = η₁ - η₀ = (X₁'β) - (X₀'β) = (X₁ - X₀)'β = ΔX'β
∇β = ΔX = X₁ - X₀

Response Scale (with link function):

discrete_effect = μ₁ - μ₀ = g⁻¹(η₁) - g⁻¹(η₀)
∇β = g'(η₁) × X₁ - g'(η₀) × X₀  (chain rule)

Example

evaluator = contrastevaluator(compiled, data, [:treatment])
∇β = Vector{Float64}(undef, length(compiled))

# Linear scale gradients (η = Xβ scale)
contrast_gradient!(∇β, evaluator, 1, :treatment, "Control", "Drug", β)

# Response scale gradients (μ = g⁻¹(η) scale)
link = GLM.LogitLink()
contrast_gradient!(∇β, evaluator, 1, :treatment, "Control", "Drug", β, link)

# Delta method standard error
se = sqrt(∇β' * vcov_matrix * ∇β)

Integration with Delta Method

Parameter gradients enable uncertainty quantification:

# Compute discrete effect + gradient simultaneously
discrete_effect = contrast_modelrow(evaluator, row, var, from, to)
contrast_gradient!(∇β, evaluator, row, var, from, to, β, link)

# Delta method confidence intervals
variance = ∇β' * vcov_matrix * ∇β
se = sqrt(variance)
ci_lower = discrete_effect - 1.96 * se
ci_upper = discrete_effect + 1.96 * se
source
FormulaCompiler.contrast_gradientFunction
contrast_gradient(evaluator, row, var, from, to, β, [link]) -> Vector{Float64}

Convenience version that allocates and returns the gradient vector.

source

Categorical Mixtures

Utilities for constructing and validating categorical mixtures used in efficient profile-based marginal effects.

FormulaCompiler.mixFunction
mix(pairs...)

Convenient constructor for CategoricalMixture from level => weight pairs. This is the main user-facing function for creating mixture specifications.

Arguments

  • pairs...: Level => weight pairs (e.g., "A" => 0.3, "B" => 0.7)

Returns

  • CategoricalMixture: Validated mixture object ready for use with FormulaCompiler

Examples

# Basic categorical mixture
group_mix = mix("Control" => 0.4, "Treatment" => 0.6)

# Educational composition
education_mix = mix("high_school" => 0.4, "college" => 0.4, "graduate" => 0.2)

# Regional distribution using symbols
region_mix = mix(:urban => 0.7, :rural => 0.3)

# Boolean mixture (30% false, 70% true)
treated_mix = mix(false => 0.3, true => 0.7)

# Works with any comparable type
age_group_mix = mix("young" => 0.25, "middle" => 0.50, "old" => 0.25)

Validation

The mix() function automatically validates:

  • At least one level => weight pair is provided
  • All weights are non-negative
  • Weights sum to 1.0 (within numerical tolerance)
  • All levels are unique

Integration with FormulaCompiler

CounterfactualVector Pattern for Categorical Mixtures

The unified row-wise architecture provides efficient single-row mixture perturbations:

using FormulaCompiler, DataFrames, Tables

# Prepare data with mixture column
df = DataFrame(
    y = randn(1000),
    x = randn(1000),
    group = fill(mix("A" => 0.4, "B" => 0.6), 1000)  # Baseline mixture
)
data = Tables.columntable(df)

# Compile formula
model = lm(@formula(y ~ x * group), df)
compiled = compile_formula(model, data)

# Pattern 1: Single-row mixture perturbation
# Create counterfactual vector for mixture column
cf_mixture = counterfactualvector(data.group, 1)  # CategoricalMixtureCounterfactualVector

# Apply different mixture to specific row
new_mixture = mix("A" => 0.8, "B" => 0.2)  # Policy counterfactual
update_counterfactual_row!(cf_mixture, 500)  # Target row 500
update_counterfactual_replacement!(cf_mixture, new_mixture)

# Evaluate with counterfactual data
data_cf = (data..., group=cf_mixture)
output = Vector{Float64}(undef, length(compiled))
compiled(output, data_cf, 500)  # Row 500 uses new mixture, others use baseline

# Pattern 2: Population marginal effects with mixture profiles
function mixture_marginal_effects(model, data, base_mixture, alt_mixture)
    compiled = compile_formula(model, data)
    cf_mixture = counterfactualvector(data.group, 1)
    data_cf = (data..., group=cf_mixture)

    n_rows = length(data.x)
    baseline_effects = Vector{Float64}(undef, n_rows)
    alternative_effects = Vector{Float64}(undef, n_rows)

    for row in 1:n_rows
        update_counterfactual_row!(cf_mixture, row)

        # Baseline mixture
        update_counterfactual_replacement!(cf_mixture, base_mixture)
        compiled(view(baseline_effects, row:row), data_cf, row)

        # Alternative mixture
        update_counterfactual_replacement!(cf_mixture, alt_mixture)
        compiled(view(alternative_effects, row:row), data_cf, row)
    end

    return mean(alternative_effects - baseline_effects)
end

# Example: Policy effect of changing group composition
base_mix = mix("A" => 0.4, "B" => 0.6)
policy_mix = mix("A" => 0.7, "B" => 0.3)
effect = mixture_marginal_effects(model, data, base_mix, policy_mix)

Reference Grid Pattern

For systematic marginal effects computation across different mixture profiles:

# Create reference grid with multiple mixture specifications
mixtures = [
    mix("A" => 1.0, "B" => 0.0),    # Pure A
    mix("A" => 0.5, "B" => 0.5),    # Balanced
    mix("A" => 0.0, "B" => 1.0)     # Pure B
]

# Evaluate effects across all mixture profiles
effects_by_mixture = Vector{Float64}(undef, length(mixtures))
cf_mixture = counterfactualvector(data.group, 1)
data_cf = (data..., group=cf_mixture)

for (i, mixture_spec) in enumerate(mixtures)
    update_counterfactual_replacement!(cf_mixture, mixture_spec)

    # Compute average effect across all rows for this mixture
    row_effects = Vector{Float64}(undef, n_rows)
    for row in 1:n_rows
        update_counterfactual_row!(cf_mixture, row)
        compiled(view(row_effects, row:row), data_cf, row)
    end
    effects_by_mixture[i] = mean(row_effects)
end

Performance

Mixture creation is lightweight and validation happens at construction time. The resulting CategoricalMixture objects are compiled into zero-allocation evaluators by FormulaCompiler's compilation system.

source
FormulaCompiler.CategoricalMixtureType
CategoricalMixture{T}

Represents a mixture of categorical levels with associated weights for statistical analysis. Used to specify population composition scenarios and marginal effects computation.

Fields

  • levels::Vector{T}: Categorical levels (strings, symbols, booleans, or other types)
  • weights::Vector{Float64}: Associated weights (must sum to 1.0)

Example

# Educational composition mixture
edu_mix = CategoricalMixture(["high_school", "college"], [0.6, 0.4])

# Using the convenient mix() constructor
treatment_mix = mix("control" => 0.4, "treatment" => 0.6)
boolean_mix = mix(false => 0.3, true => 0.7)

Validation

  • Levels and weights must have the same length
  • All weights must be non-negative
  • Weights must sum to 1.0 (within tolerance)
  • Levels must be unique

Integration with FormulaCompiler

CategoricalMixture objects are automatically detected by FormulaCompiler's compilation system and compiled into efficient zero-allocation evaluators using MixtureContrastOp.

source
FormulaCompiler.MixtureWithLevelsType
MixtureWithLevels{T}

Wrapper that includes original categorical levels with the mixture for FormulaCompiler processing. This type provides proper type-safe access to mixture components for the compilation system.

Fields

  • mixture::CategoricalMixture{T}: The core mixture specification
  • original_levels::Vector{String}: Original levels from the data column

Usage

This type is used internally by FormulaCompiler's scenario system to provide type-safe mixture processing with access to both mixture specifications and original data structure.

# Usually created automatically by FormulaCompiler's scenario system
mixture = mix("A" => 0.3, "B" => 0.7)
original_levels = ["A", "B", "C"]  # From the actual data column
wrapper = MixtureWithLevels(mixture, original_levels)

# Direct property access
wrapper.mixture.levels     # Access to mixture levels
wrapper.mixture.weights    # Access to mixture weights
wrapper.original_levels    # Access to original data levels
source
FormulaCompiler.validate_mixture_against_dataFunction
validate_mixture_against_data(mixture::CategoricalMixture, col, var::Symbol)

Validate that all levels in the mixture exist in the actual data column. Throws ArgumentError if any mixture levels are not found in the data.

Arguments

  • mixture::CategoricalMixture: The mixture specification to validate
  • col: The data column to validate against
  • var::Symbol: Variable name for error reporting

Throws

  • ArgumentError: If mixture contains levels not found in the data

Examples

# Validate mixture against categorical data
data_col = categorical(["A", "B", "C", "A", "B"])
mixture = mix("A" => 0.5, "B" => 0.5)
validate_mixture_against_data(mixture, data_col, :group)  # ✓ Valid

# This would throw an error
bad_mixture = mix("A" => 0.5, "X" => 0.5)  # "X" not in data
validate_mixture_against_data(bad_mixture, data_col, :group)  # ✗ Error

This function is used internally by FormulaCompiler's scenario system to ensure mixture specifications are compatible with the actual data.

source
FormulaCompiler.mixture_to_scenario_valueFunction
mixture_to_scenario_value(mixture::CategoricalMixture, original_col)

Convert a categorical mixture to a representative value for FormulaCompiler scenario creation. Uses weighted average encoding to provide a smooth, continuous representation.

Strategy

  • CategoricalArray: Weighted average of level indices
  • Bool: Probability of true (equivalent to current fractional Bool support)
  • Other: Weighted average of sorted unique level indices

Arguments

  • mixture::CategoricalMixture: The mixture to convert
  • original_col: The original data column for context

Returns

  • Float64: Continuous representation of the mixture

Examples

# Boolean mixture -> probability of true
bool_mix = mix(false => 0.3, true => 0.7)
mixture_to_scenario_value(bool_mix, [true, false, true]) # -> 0.7

# Categorical mixture -> weighted average of level indices
cat_mix = mix("A" => 0.6, "B" => 0.4)  
cat_col = categorical(["A", "B", "C"])
mixture_to_scenario_value(cat_mix, cat_col) # -> 1.4 (0.6*1 + 0.4*2)

This function is used internally by FormulaCompiler's scenario system to convert mixture specifications into values that can be used with the existing override system.

source

Utilities

FormulaCompiler.notFunction
not(x)

Logical NOT operation for use in formula specifications.

Arguments

  • x::Bool: Returns the logical negation (!x)
  • x::Real: Returns 1 - x (useful for probability complements)

Returns

  • For Bool: The opposite boolean value
  • For Real: The complement (1 - x)

Example

# In a formula
model = lm(@formula(y ~ not(treatment)), df)

# For probabilities
p = 0.3
q = not(p)  # 0.7
Warning

For Real values, this assumes x is in [0,1] range. No bounds checking is performed.

source