Benchmark Protocol
Purpose
- Provide a reproducible method to validate performance and allocation claims.
- Report results with context and conservative interpretation.
Environment
- Julia: record
VERSION(e.g., 1.10.x or 1.11.x) - CPU: model, frequency, cores;
Sys.CPU_NAMEif available - OS: name and version
- Threads:
Threads.nthreads()and BLAS threads (BLAS.get_num_threads()) - Packages:
Pkg.status()for FormulaCompiler, GLM, MixedModels, ForwardDiff, Tables, DataFrames
Runner Script
- A convenience runner is provided at
scripts/benchmarks.jl. - Examples:
julia --project=. scripts/benchmarks.jl(default subset)julia --project=. scripts/benchmarks.jl core deriv margins se(select tasks)
- Prints per-benchmark median/min time and minimum memory; include this output with your Environment section.
Minimal runner
- Edit options at the top of
scripts/benchmarks_simple.jland run:julia --project=. scripts/benchmarks_simple.jl
- Set
selected,fast,out,file,tagin the script; no CLI flags needed.
Setup
- Activate project:
Pkg.activate("..");Pkg.instantiate() - Using:
using BenchmarkTools, FormulaCompiler, GLM, MixedModels, Tables, DataFrames, CategoricalArrays, Random - For marginal effects benchmarks (sections 5-6):
using Margins# See https://github.com/emfeltham/Margins.jl - Data format: Prefer
Tables.columntable(df)for evaluation - Warmup: run each function once before benchmarking
Conventions
- Use
@benchmarkwith concrete arguments; avoid globals - Record:
minimum(time),median(time),minimum(memory) - Target allocations: FD paths 0 bytes; AD paths ≤512 bytes (ForwardDiff overhead)
- Present ranges, not single points; note environment details
Benchmarks
- Core Row Evaluation (Compiled)
- Model: small GLM (e.g.,
y ~ x * group + log1p(abs(z))with categoricalgroup) - Steps:
- Fit model with GLM; build
data = Tables.columntable(df) compiled = compile_formula(model, data)row = Vector{Float64}(undef, length(compiled))- Single row:
@benchmark $compiled($row, $data, 25) - Tight loop (amortized): call inside a loop over indices; check stability
- Fit model with GLM; build
- Targets: ≈O(10^1–10^2) ns;
minimum(memory) == 0
- Allocating vs In-place Interfaces
modelrow(model, data, i)vsmodelrow!with preallocated buffer- Targets: In-place 0 bytes; allocating shows expected vector/matrix allocations
- Counterfactual Overhead
- Build counterfactual data:
data_cf = merge(data, (x = fill(2.0, length(data.x)),)) - Compare evaluation:
@benchmark $compiled($row, $data, 25)vs@benchmark $compiled($row, $data_cf, 25) - Target: identical times within noise; 0 allocations for both
- Derivative Jacobian (AD and FD)
- Build evaluators:
de_ad = derivativeevaluator(:ad, compiled, data, [:x, :z]);de_fd = derivativeevaluator(:fd, compiled, data, [:x, :z]) - AD Jacobian:
J = similar(rand(length(compiled), 2)); @benchmark derivative_modelrow!($J, $de_ad, 25) - FD Jacobian:
@benchmark derivative_modelrow!($J, $de_fd, 25) - FD single-column:
col = similar(rand(length(compiled))); @benchmark fd_jacobian_column!($col, $de_fd, 1, 1) - Targets: AD ≤512 bytes; FD 0 bytes
- Marginal Effects (η and μ) [Requires Margins.jl]
using Marginsβ = coef(model);g = similar(rand(2))# Assuming 2 continuous variables- η-scale AD:
@benchmark marginal_effects_eta!($g, $de_ad, $β, 25) - η-scale FD:
@benchmark marginal_effects_eta!($g, $de_fd, $β, 25) - μ-scale with link (e.g., Logit):
@benchmark marginal_effects_mu!($g, $de_ad, $β, 25, LogitLink())and@benchmark marginal_effects_mu!($g, $de_fd, $β, 25, LogitLink()) - Targets: FD 0 bytes; AD ≤512 bytes
- Delta Method SE [Requires Margins.jl]
using Marginsgβ = similar(rand(length(β))); Σ = I*1.0(orvcov(model))@benchmark delta_method_se($gβ, $Σ)- Target: 0 bytes, O(10^1) ns for dense small Σ
- MixedModels Fixed Effects
- Fit a small LMM/GLMM; compile with fixed-effects extraction
- Benchmark compiled row evaluation as in (1)
- Target: Same guarantees (0 bytes; similar timing)
- Scaling and Complexity
- Simple vs complex formulas; small vs larger
OutputSize - Report time growth and allocation behavior (should remain 0 for core paths)
- Size Invariance (Per-Row)
- Use
scale_nto run the same per-row evaluation on increasing data sizes (e.g., 10k, 100k, 1M) - Expectation: per-row latency and allocations remain effectively constant (0 B) as
nincreases
Reporting Template
- Environment summary
- For each benchmark: code snippet, median/min time, allocations, brief interpretation
- Deviations: note and investigate; check warmup, data format, thread counts
Edge Cases
- Integer columns: verify automatic Float64 conversion in evaluator paths
- Large tuples: ensure hybrid dispatch preserves 0 allocations
- Categorical mixtures: include a case if used (see mixtures guide)
Interpretation
- Treat numbers as indicative; absolute values vary by system
- Prioritize allocation guarantees and scaling trends over exact nanoseconds