Examples
Real-world examples demonstrating FormulaCompiler.jl capabilities across multiple domains.
For fundamental usage patterns, see Basic Usage. For advanced computational techniques, see Advanced Features.
Quick Reference
Essential patterns for immediate use. All examples use synthetic data for simplicity and self-containment.
Core Compilation Pattern
using FormulaCompiler, GLM, DataFrames, Tables
# Basic setup
df = DataFrame(x = randn(100), y = randn(100), group = rand(["A", "B"], 100))
model = lm(@formula(y ~ x * group), df)
data = Tables.columntable(df)
# Compile and evaluate
compiled = compile_formula(model, data)
output = Vector{Float64}(undef, length(compiled))
compiled(output, data, 1) # Zero allocationsPerformance Validation
using BenchmarkTools
# Verify zero allocations
result = @benchmark $compiled($output, $data, 1)
@assert result.memory == 0 "Expected zero allocations"Counterfactual Analysis
# Single policy scenario using direct data modification
n_rows = length(data.x)
data_policy = merge(data, (x = fill(2.0, n_rows), group = fill("A", n_rows)))
compiled(output, data_policy, 1) # Evaluate with modified data
# Multi-scenario analysis
x_values = [-1.0, 0.0, 1.0]
group_values = ["A", "B"]
results = Matrix{Float64}(undef, 6, length(compiled)) # 3×2 = 6 scenarios
scenario_idx = 1
for x_val in x_values
for group_val in group_values
data_scenario = merge(data, (x = fill(x_val, n_rows), group = fill(group_val, n_rows)))
compiled(view(results, scenario_idx, :), data_scenario, 1)
scenario_idx += 1
end
endMarginal Effects
using Margins # Provides marginal_effects_eta! and marginal_effects_mu!
# Identify continuous variables and build evaluators
vars = continuous_variables(compiled, data) # [:x]
de_ad = derivativeevaluator(:ad, compiled, data, vars) # Automatic differentiation: higher accuracy (preferred)
de_fd = derivativeevaluator(:fd, compiled, data, vars) # Finite differences: explicit step control
β = coef(model)
# Compute marginal effects
g = Vector{Float64}(undef, length(vars))
marginal_effects_eta!(g, de_ad, β, 1) # Zero allocations, higher accuracy
marginal_effects_eta!(g, de_fd, β, 1) # Zero allocations, explicit step controlBatch Processing
# Multiple rows at once
n_rows = 50
results = Matrix{Float64}(undef, n_rows, length(compiled))
for i in 1:n_rows
compiled(view(results, i, :), data, i) # Zero allocations
endDomain Applications
Real-world applications using authentic datasets from RDatasets.jl. These examples demonstrate FormulaCompiler.jl's capabilities with data structures practitioners encounter in their domains.
Economics: Labor Market Analysis
Wage determination analysis using real Belgian wage data:
using RDatasets, FormulaCompiler, GLM, Tables
# Load real wage data from Belgian labor market
wages = dataset("Ecdat", "Bwages") # 1472 observations
# Columns: Wage (hourly wage), Educ (education years), Exper (experience), Sex
# Convert categorical variable properly
wages.Male = wages.Sex .== "male"
data = Tables.columntable(wages)
# Wage determination model with gender interaction
model = lm(@formula(log(Wage) ~ Educ * Male + Exper + I(Exper^2)), wages)
compiled = compile_formula(model, data)Policy Analysis with Scenarios
Analyze gender pay gap under different policy scenarios:
# Define policy scenarios using direct data modification
mean_educ = mean(wages.Educ)
mean_exper = mean(wages.Exper)
n_rows = length(data.Educ)
output = Vector{Float64}(undef, length(compiled))
β = coef(model)
# Create scenario data
scenarios = [
("status_quo", data),
("equal_education", merge(data, (Educ = fill(mean_educ, n_rows),))),
("equal_experience", merge(data, (Exper = fill(mean_exper, n_rows),))),
("standardized_worker", merge(data, (Educ = fill(mean_educ, n_rows), Exper = fill(mean_exper, n_rows))))
]
# Policy scenario analysis
for (scenario_name, scenario_data) in scenarios
# Sample analysis for first 100 workers
wages_predicted = Float64[]
for worker in 1:100
compiled(output, scenario_data, worker)
push!(wages_predicted, exp(dot(β, output))) # Convert from log scale
end
mean_wage = mean(wages_predicted)
println("$(scenario_name): Mean predicted wage = \$$(round(mean_wage, digits=2))")
endMarginal Effects on Wages
Compute returns to education and experience:
using Margins # Provides marginal_effects_eta!
# Build derivative evaluator for continuous variables
continuous_vars = [:Educ, :Exper] # Education and experience are continuous
de_fd = derivativeevaluator(:fd, compiled, data, continuous_vars)
# Compute marginal effects for representative workers
g_eta = Vector{Float64}(undef, length(continuous_vars))
# Male worker with median characteristics
male_median_idx = findfirst(row -> row.Male && row.Educ ≈ median(wages.Educ), eachrow(wages))
if !isnothing(male_median_idx)
marginal_effects_eta!(g_eta, de_fd, β, male_median_idx)
println("Male median worker - Returns to education: $(round(g_eta[1]*100, digits=2))%")
println("Male median worker - Returns to experience: $(round(g_eta[2]*100, digits=2))%")
endEngineering: Automotive Performance Analysis
Engine performance modeling using the classic mtcars dataset:
using RDatasets, FormulaCompiler, GLM
# Load automotive engineering data
mtcars = dataset("datasets", "mtcars") # 32 classic cars
# Key variables: MPG (fuel efficiency), HP (horsepower), WT (weight), Cyl (cylinders)
data = Tables.columntable(mtcars)
# Fuel efficiency model with engine characteristics
model = lm(@formula(MPG ~ HP * WT + Cyl + log(HP)), mtcars)
compiled = compile_formula(model, data)Engineering Design Scenarios
Analyze fuel efficiency under different design specifications:
# Engineering design analysis using direct data modification
base_hp = mean(mtcars.HP)
base_wt = mean(mtcars.WT)
n_rows = length(data.HP)
# Define design parameter combinations
hp_values = [base_hp * 0.8, base_hp, base_hp * 1.2] # -20%, baseline, +20% power
wt_values = [base_wt * 0.9, base_wt, base_wt * 1.1] # -10%, baseline, +10% weight
cyl_values = [4, 6, 8] # Engine configurations
# Performance analysis across design space (3×3×3 = 27 scenarios)
output = Vector{Float64}(undef, length(compiled))
β = coef(model)
scenario_results = Float64[]
scenario_descriptions = String[]
for hp_val in hp_values, wt_val in wt_values, cyl_val in cyl_values
# Create scenario data
data_scenario = merge(data, (
HP = fill(hp_val, n_rows),
WT = fill(wt_val, n_rows),
Cyl = fill(cyl_val, n_rows)
))
# Evaluate reference car design with these parameters
compiled(output, data_scenario, 1)
predicted_mpg = dot(β, output)
push!(scenario_results, predicted_mpg)
# Track scenario description
hp_change = round((hp_val / base_hp - 1) * 100, digits=1)
wt_change = round((wt_val / base_wt - 1) * 100, digits=1)
push!(scenario_descriptions, "HP: $(hp_change)%, WT: $(wt_change)%, Cyl: $(cyl_val)")
end
best_scenario_idx = argmax(scenario_results)
best_mpg = scenario_results[best_scenario_idx]
println("Best design scenario: $(scenario_descriptions[best_scenario_idx])")
println("Predicted MPG: $(round(best_mpg, digits=2))")Performance Sensitivity Analysis
Compute sensitivity to design parameters:
using Margins # Provides marginal_effects_eta!
# Marginal effects on fuel efficiency
engineering_vars = [:HP, :WT] # Continuous engineering parameters
de_fd = derivativeevaluator(:fd, compiled, data, engineering_vars)
g = Vector{Float64}(undef, length(engineering_vars))
# Analyze sensitivity for different vehicle classes
for (car_type, indices) in [("Sports cars", findall(x -> x > 200, mtcars.HP)),
("Economy cars", findall(x -> x < 100, mtcars.HP))]
if !isempty(indices)
representative_idx = indices[1]
marginal_effects_eta!(g, de_fd, β, representative_idx)
println("$car_type sensitivity:")
println(" MPG change per HP unit: $(round(g[1], digits=3))")
println(" MPG change per 1000lb weight: $(round(g[2]*1000, digits=3))")
end
endBiostatistics: Cancer Survival Analysis
Clinical outcome modeling using real lung cancer data:
using RDatasets
# Load NCCTG lung cancer clinical trial data
lung = dataset("survival", "lung") # 228 patients
# Variables: time (survival days), status (censoring), age, sex, ph.ecog (performance score)
# Remove missing values for demonstration
lung_complete = dropmissing(lung, [:age, :sex, :ph_ecog])
data = Tables.columntable(lung_complete)
# Survival time model (using log transformation for demonstration)
# In practice, would use survival analysis methods
model = lm(@formula(log(time + 1) ~ age + sex + ph_ecog + age * sex), lung_complete)
compiled = compile_formula(model, data)Clinical Scenarios
Analyze survival outcomes under different patient profiles:
# Clinical analysis using direct data modification
n_rows = length(data.age)
output = Vector{Float64}(undef, length(compiled))
β = coef(model)
# Define clinical scenarios
clinical_scenarios = [
("young_male", merge(data, (age = fill(50, n_rows), sex = fill(1, n_rows)))),
("old_male", merge(data, (age = fill(70, n_rows), sex = fill(1, n_rows)))),
("young_female", merge(data, (age = fill(50, n_rows), sex = fill(2, n_rows)))),
("old_female", merge(data, (age = fill(70, n_rows), sex = fill(2, n_rows)))),
("high_performance", merge(data, (ph_ecog = fill(0, n_rows),))),
("poor_performance", merge(data, (ph_ecog = fill(2, n_rows),)))
]
# Clinical outcome analysis
for (profile_name, scenario_data) in clinical_scenarios
compiled(output, scenario_data, 1)
predicted_log_survival = dot(β, output)
predicted_days = exp(predicted_log_survival) - 1
println("$(profile_name): Predicted survival = $(round(predicted_days, digits=0)) days")
endClinical Risk Factors
Quantify impact of patient characteristics on outcomes:
using Margins # Provides marginal_effects_eta!
# Marginal effects for continuous clinical variables
clinical_vars = [:age] # Age is the main continuous predictor
de_fd = derivativeevaluator(:fd, compiled, data, clinical_vars)
g = Vector{Float64}(undef, length(clinical_vars))
# Risk assessment for different patient groups
for (group, sex_val) in [("Male patients", 1), ("Female patients", 2)]
group_indices = findall(x -> x == sex_val, lung_complete.sex)
if !isempty(group_indices)
representative_idx = group_indices[div(length(group_indices), 2)] # Median patient
marginal_effects_eta!(g, de_fd, β, representative_idx)
daily_age_effect = g[1] # Effect per year of age
yearly_effect = exp(daily_age_effect) - 1 # Convert from log scale
println("$group - Age effect: $(round(yearly_effect*100, digits=2))% per year")
end
endSocial Sciences: Educational Outcomes
University admission analysis using UC Berkeley data:
# Load UC Berkeley admission data (famous dataset for Simpson's paradox)
ucb = dataset("datasets", "UCBAdmissions")
# This is aggregate data, so we'll expand it for modeling
# Create individual-level data from aggregate counts
individual_data = DataFrame()
for row in eachrow(ucb)
n_cases = Int(row.Freq)
individual_cases = DataFrame(
Admitted = fill(row.Admit == "Admitted", n_cases),
Gender = fill(row.Gender, n_cases),
Department = fill(row.Dept, n_cases)
)
individual_data = vcat(individual_data, individual_cases)
end
# Convert to appropriate types
individual_data.Male = individual_data.Gender .== "Male"
data = Tables.columntable(individual_data)
# Admission probability model
model = glm(@formula(Admitted ~ Male * Department), individual_data, Binomial(), LogitLink())
compiled = compile_formula(model, data)Educational Policy Analysis
Analyze admission scenarios across departments:
# Policy analysis using direct data modification
n_rows = length(data.Male)
output = Vector{Float64}(undef, length(compiled))
β = coef(model)
# Define policy scenarios
policy_scenarios = [
("gender_blind", merge(data, (Male = fill(true, n_rows),))),
("gender_blind_female", merge(data, (Male = fill(false, n_rows),))),
("dept_a_focus", merge(data, (Department = fill("A", n_rows),))),
("dept_f_focus", merge(data, (Department = fill("F", n_rows),)))
]
# Policy impact analysis
for (policy_name, scenario_data) in policy_scenarios
admission_probs = Float64[]
# Sample 100 cases for analysis
sample_size = min(100, n_rows)
for i in 1:sample_size
compiled(output, scenario_data, i)
linear_pred = dot(β, output)
prob = 1 / (1 + exp(-linear_pred)) # Logistic transformation
push!(admission_probs, prob)
end
mean_prob = mean(admission_probs)
println("$(policy_name): Mean admission probability = $(round(mean_prob*100, digits=1))%")
endAdvanced Computational Patterns
High-performance applications combining multiple FormulaCompiler.jl features with authentic datasets.
Large-Scale Monte Carlo Simulation
Bootstrap confidence intervals for economic policy effects:
using RDatasets, Statistics, Random
# Use wage data for bootstrap analysis
wages = dataset("Ecdat", "Bwages")
wages.Male = wages.Sex .== "male"
n_obs = nrow(wages)
# Policy model
base_model = lm(@formula(log(Wage) ~ Educ * Male + Exper), wages)
base_data = Tables.columntable(wages)
compiled = compile_formula(base_model, base_data)
# Bootstrap function for policy effect estimation
function bootstrap_policy_effect(n_bootstrap=1000)
Random.seed!(123) # Reproducible results
policy_effects = Float64[]
output = Vector{Float64}(undef, length(compiled))
for b in 1:n_bootstrap
# Bootstrap sample
boot_indices = rand(1:n_obs, n_obs)
# Create policy scenario
n_rows = length(base_data.Educ)
policy_data = merge(base_data, (Educ = fill(mean(wages.Educ), n_rows),))
# Compute policy effect for bootstrap sample
status_quo_wages = Float64[]
policy_wages = Float64[]
for idx in boot_indices[1:100] # Sample subset for speed
# Status quo
compiled(output, base_data, idx)
push!(status_quo_wages, exp(dot(coef(base_model), output)))
# Policy scenario
compiled(output, policy_data, idx)
push!(policy_wages, exp(dot(coef(base_model), output)))
end
# Policy effect (proportional change)
effect = mean(policy_wages) / mean(status_quo_wages) - 1
push!(policy_effects, effect)
end
return policy_effects
end
# Run bootstrap analysis
println("Running bootstrap analysis...")
policy_effects = bootstrap_policy_effect(500)
# Results
mean_effect = mean(policy_effects)
ci_lower = quantile(policy_effects, 0.025)
ci_upper = quantile(policy_effects, 0.975)
println("Policy effect: $(round(mean_effect*100, digits=2))%")
println("95% CI: [$(round(ci_lower*100, digits=2))%, $(round(ci_upper*100, digits=2))%]")Cross-Validation with Real Data
Model validation using automotive performance data:
using Random
# Load and prepare mtcars data
mtcars = dataset("datasets", "mtcars")
n_cars = nrow(mtcars)
data = Tables.columntable(mtcars)
function cross_validate_performance(k_folds=5)
Random.seed!(456)
fold_indices = rand(1:k_folds, n_cars)
fold_errors = Float64[]
output = Vector{Float64}(undef, 0) # Will resize based on model
for fold in 1:k_folds
# Split data
train_idx = findall(x -> x != fold, fold_indices)
test_idx = findall(x -> x == fold, fold_indices)
# Fit model on training data
train_data = mtcars[train_idx, :]
model = lm(@formula(MPG ~ HP + WT + Cyl), train_data)
# Compile for test evaluation
compiled = compile_formula(model, data)
if isempty(output)
output = Vector{Float64}(undef, length(compiled))
end
β = coef(model)
# Predict on test set
test_errors = Float64[]
for test_car in test_idx
compiled(output, data, test_car)
predicted = dot(β, output)
actual = mtcars.MPG[test_car]
push!(test_errors, (predicted - actual)^2)
end
push!(fold_errors, mean(test_errors))
end
return sqrt(mean(fold_errors)) # RMSE
end
# Run cross-validation
cv_rmse = cross_validate_performance(5)
println("Cross-validation RMSE: $(round(cv_rmse, digits=2)) MPG")Parallel Scenario Analysis
Distributed policy analysis across multiple cores:
using Distributed, SharedArrays
# For demonstration - would typically use addprocs() to add workers
# addprocs(2)
# @everywhere using FormulaCompiler, RDatasets, GLM, Tables
function parallel_policy_analysis()
wages = dataset("Ecdat", "Bwages")
wages.Male = wages.Sex .== "male"
data = Tables.columntable(wages)
model = lm(@formula(log(Wage) ~ Educ * Male + Exper), wages)
compiled = compile_formula(model, data)
# Define policy grid parameters
educ_levels = [10, 12, 14, 16, 18, 20] # Education levels
exper_levels = [0, 5, 10, 15, 20, 25] # Experience levels
male_levels = [false, true] # Gender
n_rows = length(data.Educ)
n_scenarios = length(educ_levels) * length(exper_levels) * length(male_levels) # 6×6×2 = 72
n_workers = min(100, n_rows) # Sample size for analysis
# Results storage
results = SharedArray{Float64}(n_scenarios)
# Parallel computation would go here
# @distributed for scenario_idx in 1:n_scenarios
scenario_idx = 1
for educ_val in educ_levels, exper_val in exper_levels, male_val in male_levels
# Create scenario data
scenario_data = merge(data, (
Educ = fill(educ_val, n_rows),
Exper = fill(exper_val, n_rows),
Male = fill(male_val, n_rows)
))
output = Vector{Float64}(undef, length(compiled))
β = coef(model)
wage_predictions = Float64[]
for worker in 1:n_workers
compiled(output, scenario_data, worker)
wage_pred = exp(dot(β, output))
push!(wage_predictions, wage_pred)
end
results[scenario_idx] = mean(wage_predictions)
scenario_idx += 1
end
return results, educ_levels, exper_levels, male_levels # Return results and grid parameters
end
# Run parallel analysis
println("Running comprehensive policy analysis...")
results, educ_levels, exper_levels, male_levels = parallel_policy_analysis()
# Find optimal policy
best_idx = argmax(results)
best_wage = results[best_idx]
# Decode scenario index to parameters
n_exper = length(exper_levels)
n_male = length(male_levels)
educ_idx = div(best_idx - 1, n_exper * n_male) + 1
exper_idx = div(mod(best_idx - 1, n_exper * n_male), n_male) + 1
male_idx = mod(best_idx - 1, n_male) + 1
println("Optimal policy scenario:")
println(" Education: $(educ_levels[educ_idx]) years")
println(" Experience: $(exper_levels[exper_idx]) years")
println(" Gender: $(male_levels[male_idx] ? "Male" : "Female")")
println(" Mean predicted wage: \$$(round(best_wage, digits=2))")Performance Notes
All examples demonstrate FormulaCompiler.jl's key performance characteristics:
- Zero-allocation core evaluation:
compiled(output, data, row)calls allocate zero bytes - Memory-efficient scenarios: Override system uses constant memory regardless of data size
- Backend selection: Choose between automatic differentiation (
:ad, higher accuracy) and finite differences (:fd, explicit control) usingderivativeevaluator(:ad/:fd, ...) - Scalable patterns: Performance remains constant regardless of dataset size
For detailed performance optimization techniques, see the Performance Guide.
Further Reading
- Basic Usage - Fundamental patterns and validation techniques
- Scenario Analysis - Comprehensive coverage of the override system
- Advanced Features - Derivative computation and high-performance patterns
- API Reference - Complete function documentation