Population Grouping Framework

Comprehensive hierarchical analysis for stratified marginal effects

Conceptual Foundation

Margins.jl implements a population-based grouping framework that computes average marginal effects (AME) and average adjusted predictions (AAP) within stratified subgroups of the observed data.

Core Design Principles

Population-Based Analysis

All operations maintain population averaging semantics - computing effects by averaging across actual or modified populations, not evaluating at synthetic representative points.

Orthogonal Parameters

Three independent dimensions combine multiplicatively:

vars: Which variables to compute marginal effects for
groups: How to stratify the analysis (data structure)
scenarios: What counterfactual scenarios to consider (data modification)

Single Fundamental Operation

All grouping reduces to: stratify data into subgroups, compute population margins within each subgroup.

Basic Grouping Patterns

Simple Categorical Grouping

Compute effects separately within each category of a grouping variable:

using Margins, DataFrames, GLM

# Effects by education level
education_effects = population_margins(model, data; 
                                     type=:effects, 
                                     groups=:education)

# Results: separate effects for each education category
DataFrame(education_effects)

Cross-Tabulated Grouping

Analyze effects across combinations of multiple categorical variables:

# Effects by education × gender combinations
demographic_effects = population_margins(model, data;
                                        type=:effects,
                                        groups=[:education, :gender])

# Results: effects for (HS,Male), (HS,Female), (College,Male), (College,Female), etc.

Advanced Hierarchical Grouping

Nested Grouping with `=>` Operator

The => operator creates hierarchical nesting where the right side is computed within each level of the left side:

# Region first, then education within each region
nested_effects = population_margins(model, data;
                                  type=:effects,
                                  groups=:region => :education)

# Results: (North,HS), (North,College), (South,HS), (South,College)

Deep Hierarchical Nesting

Multiple levels of nesting support complex organizational structures:

# Three-level hierarchy: country → region → education
deep_hierarchy = population_margins(model, data;
                                  type=:effects,
                                  groups=:country => (:region => :education))

# Four-level hierarchy: sector → company → department → position
organizational = population_margins(model, data;
                                  type=:effects, 
                                  groups=:sector => (:company => (:department => :position)))

Parallel Grouping Within Hierarchy

Complex patterns combining hierarchical and cross-tabulated structures:

# Region first, then education×gender cross-tab within each region
parallel_nested = population_margins(model, data;
                                   type=:effects,
                                   groups=(:region => [:education, :gender]))

# Region first, then separate analyses for education levels AND income quartiles
mixed_parallel = population_margins(model, data;
                                  type=:effects,
                                  groups=(:region => [:education, (:income, 4)]))

Continuous Variable Binning

Quantile-Based Binning

Automatic binning using quantiles with professional statistical terminology:

# Quartile analysis (Q1, Q2, Q3, Q4)
income_quartiles = population_margins(model, data;
                                    type=:effects,
                                    groups=(:income, 4))

# Tertile analysis (T1, T2, T3) 
score_tertiles = population_margins(model, data;
                                  type=:effects,
                                  groups=(:test_score, 3))

# Quintile analysis (P1, P2, P3, P4, P5)
wealth_quintiles = population_margins(model, data;
                                    type=:effects,
                                    groups=(:wealth, 5))

Custom Threshold Binning

Policy-relevant thresholds using mathematical interval notation:

# Income brackets for tax policy analysis
tax_brackets = population_margins(model, data;
                                type=:effects,
                                groups=(:income, [25000, 50000, 75000]))

# Results: ["< 25000", "[25000, 50000)", "[50000, 75000)", ">= 75000"]

# Poverty line analysis
poverty_analysis = population_margins(model, data;
                                    type=:effects,
                                    groups=(:income, [federal_poverty_line]))

# Results: ["< 12880", ">= 12880"] (using 2023 federal poverty guideline)

Mixed Categorical and Continuous Grouping

Combine categorical variables with binned continuous variables:

# Education levels × income quartiles
education_income = population_margins(model, data;
                                    type=:effects,
                                    groups=[:education, (:income, 4)])

# Results: (HS,Q1), (HS,Q2), (HS,Q3), (HS,Q4), (College,Q1), etc.

# Geographic region × age quintiles × gender
complex_demographics = population_margins(model, data;
                                        type=:effects,
                                        groups=[:region, (:age, 5), :gender])

Counterfactual Scenario Analysis

See Population Scenarios for detailed semantics and implementation notes on scenarios in population analysis.

Policy Scenario Framework

The scenarios parameter modifies variable values for the entire population, creating counterfactual analyses:

# Binary treatment analysis
treatment_effects = population_margins(model, data;
                                     type=:effects,
                                     scenarios=(:treatment = [0, 1]))

# Multi-level policy scenarios
policy_scenarios = population_margins(model, data;
                                    type=:effects,
                                    scenarios=(:policy_level = ["none", "moderate", "aggressive"]))

Multi-Variable Scenarios

Cartesian product expansion for complex policy analysis:

# Treatment × policy combinations
comprehensive_policy = population_margins(model, data;
                                        type=:effects,
                                        scenarios=(:treatment = [0, 1], 
                                                      :policy = ["current", "reform"]))

# Results: 4 scenarios (2×2 combinations)

# Three-dimensional policy space
complex_scenarios = population_margins(model, data;
                                     type=:effects,
                                     scenarios=(:treatment = [0, 1],
                                                   :funding = [0.8, 1.0, 1.2],
                                                   :regulation = ["light", "standard", "strict"]))

# Results: 18 scenarios (2×3×3 combinations)

Combined Groups and Scenarios

Comprehensive Policy Analysis

Groups and scenarios combine multiplicatively for complete analytical coverage:

# Demographics × policy scenarios
full_analysis = population_margins(model, data;
                                 type=:effects,
                                 groups=[:education, :region],
                                 scenarios=(:treatment = [0, 1]))

# Results: Each education×region combination under both treatment scenarios

Advanced Applications

# Healthcare policy evaluation
healthcare_comprehensive = population_margins(health_model, health_data;
    type=:effects,
    groups=(:state => (:urban_rural => [:insurance_type, (:income, 3)])),
    scenarios=(:aca_expansion = [0, 1], :medicaid_funding = [0.8, 1.2])
)

# Results: State × Urban/Rural × (Insurance×Income-Tertiles) × ACA×Medicaid scenarios
# Total combinations: 4 states × 2 urban/rural × 12 insurance×income × 4 policy scenarios = 384 results

Important: Skip Rule for Statistical Validity

Critical Constraint: For population analysis, computing the effect of a variable while simultaneously holding it fixed (via scenarios) or using it to define subgroups (via groups) is contradictory and statistically meaningless.

The Skip Rule

To preserve statistical correctness and interpretability, population_margins() automatically skips variables that appear in vars if they also appear in groups or scenarios.

# Example: x appears in both vars and scenarios
result = population_margins(model, data;
    type=:effects,
    vars=[:x, :z],           # Request effects for x and z
    scenarios=(:x = [0, 1])  # But fix x at specific values
)
# Result: Only z effect is computed. x is skipped because it's in scenarios.
# The package silently handles this to avoid statistical errors.

Why This Rule Exists

Conceptual Problem:

Marginal effect asks: "What happens when x changes naturally?"
Scenario/Group says: "Hold x fixed at specific values" or "Stratify by x levels"
These two concepts are mutually exclusive

Examples of Invalid Requests:

# INVALID: "What's the effect of income while holding income fixed?"
population_margins(model, data;
    vars=[:income],            # Effect of income changing
    scenarios=(:income = [30000, 50000])  # But income is fixed
)
# → income is skipped from vars

# INVALID: "What's the effect of education within education groups?"
population_margins(model, data;
    vars=[:education],         # Effect of education changing
    groups=:education          # But stratified by education levels
)
# → education is skipped from vars

Practical Alternatives

Alternative 1: Profile Analysis (for Stata users)

If you want Stata-style dydx(x) over(x) (derivative of x at different values of x), use profile analysis:

# Instead of: population_margins(model, data; vars=[:x], groups=:x)  # INVALID
# Use profile margins:
result = profile_margins(model, data, cartesian_grid(x=[10, 20, 30, 40]);
    type=:effects,
    vars=[:x]
)
# Computes marginal effect of x AT each specific value of x

Alternative 2: Effects Within Strata

If you want effects within strata of x, group by a different variable or compute effects of other variables:

# GOOD: Effects of z within education groups
result = population_margins(model, data;
    type=:effects,
    vars=[:z],              # Effect of z (not education)
    groups=:education       # Stratified by education
)

# GOOD: Effects within income quintiles
result = population_margins(model, data;
    type=:effects,
    vars=[:treatment],      # Effect of treatment (not income)
    groups=(:income, 5)     # Within income quintiles
)

Alternative 3: Counterfactual Predictions

If you want to see how outcomes change as x varies, use predictions with scenarios:

# Instead of: population_margins(model, data; vars=[:x], scenarios=(:x = [...]))
# Use predictions:
result = population_margins(model, data;
    type=:predictions,         # Not effects!
    scenarios=(:x = [10, 20, 30, 40])
)
# Shows predicted outcomes at each value of x

User Notification

Current Behavior: The skip rule operates silently - variables are removed from computation without warning.

How to Check: Compare requested vars against result:

result = population_margins(model, data;
    vars=[:x, :z],
    scenarios=(:x = [0, 1])
)
df = DataFrame(result)
unique(df.variable)  # Will only show "z" (x was skipped)

Performance

Computational Complexity

Population grouping maintains efficient O(n) scaling within each subgroup:

using BenchmarkTools

# Simple grouping: O(n/k) per group for k groups
@btime population_margins($model, $data; groups=:education)

# Complex hierarchical grouping: O(n/k) per final subgroup
@btime population_margins($model, $data; groups=(:region => (:education => :gender)))

# With scenarios: same O(n/k) complexity repeated for each scenario
@btime population_margins($model, $data; groups=:education, scenarios=(:treatment = [0, 1]))

Memory Efficiency

The grouping framework avoids data duplication through efficient indexing:

Subgroup filtering: Uses DataFrame indexing, not data copying
Scenario modification: Temporary overrides without permanent data changes
Result aggregation: Minimal memory footprint for result compilation

Large Dataset Considerations

# For datasets >100k observations with many groups
# Consider selective analysis of key variables
key_analysis = population_margins(model, large_data;
                                type=:effects,
                                vars=[:primary_outcome],  # Limit variables
                                groups=(:income, 4))      # Manageable grouping

# Complex patterns still feasible for large n
complex_large = population_margins(model, large_data;
                                 type=:effects,
                                 groups=(:region => [:education, (:income, 4)]))

Best Practices

When to Use Different Grouping Patterns

Simple Grouping (groups=:var):

Single dimension analysis
Clear categorical divisions
Straightforward interpretation needs

Cross-Tabulation (groups=[:var1, :var2]):

Interaction effects important
Policy targets multiple demographics simultaneously
Comprehensive coverage needed

Hierarchical Grouping (groups=:var1 => :var2):

Natural organizational structure exists
Context matters (e.g., regions have different education systems)
Nested decision-making processes

Continuous Binning (groups=(:var, n)):

Policy-relevant thresholds exist
Distribution-based analysis needed
Quantile-based interpretation valuable

Avoiding Common Pitfalls

Combination Explosion

# Dangerous: could create 1000s of combinations
# groups=[:var1, :var2, :var3, (:var4, 10), (:var5, 5)]

# Better: use hierarchical structure
groups=:var1 => [:var2, (:var4, 4)]

Empty Subgroups

# The framework automatically detects and errors on empty subgroups
# to maintain statistical validity

Skip Rule Reference

See the dedicated section "Important: Skip Rule for Statistical Validity" above for complete documentation on how population_margins() handles variables that appear in both vars and groups/scenarios.

Interpretation Complexity

# For presentation, consider simpler patterns:
presentation_analysis = population_margins(model, data;
                                         groups=:education,
                                         scenarios=(:policy = [0, 1]))

# For comprehensive analysis, use full complexity:
research_analysis = population_margins(model, data;
                                     groups=(:region => [:education, (:income, 4)]),
                                     scenarios=(:policy = [0, 1], :funding = [0.8, 1.2]))

The population grouping framework enables sophisticated econometric analysis while maintaining computational efficiency and statistical rigor. For related details on scenarios and reference grids, see Reference Grids and for performance optimization, see Performance Guide.