Chapter 3: Describe and infer data
Note
The operations we practiced are part of a standard industry pattern called Split-Apply-Combine. It is the most robust way to solve complex data problems by decomposing them.
In the tidyverse, this strategy is mapped to specific, highly optimized functions:
group_by(): The Split operator. It partitions the dataset into logical subsets.summarise(): The Apply & Combine operator. It reduces many rows into a single row of statistical indicators.across(): The Scaling operator. It ensures that the exact same statistical treatment is applied consistently across multiple columns.library(dplyr)
# Standard descriptive pipeline
data %>%
group_by(category) %>%
summarise(
n = n(),
across(c(income, age), list(avg = ~mean(.x, na.rm = TRUE), med = ~median(.x, na.rm = TRUE)))
)Why this matters for statisticians ?
Standardization is the bedrock of valid statistical reporting.
summarise()The summarise() function is the primary tool for data reduction. It transforms a detailed table into a concise summary of statistical indicators.
group_by() to provide comparative insights across categories.| Category | R Functions |
|---|---|
| Central Tendency | mean(), median() |
| Dispersion | sd(), var(), IQR(), min(), max() |
| Position | quantile(x, probs = 0.75), first(), last() |
| Count | n() (observations), n_distinct() (unique values) |
The power of summarise() lies in its ability to combine multiple estimators within a single reduction step. You can use standard R functions, library-specific tools, or even user-defined functions.
| Statistic | R Function | Use Case |
|---|---|---|
| Mean | mean() |
Central tendency (sensitive to outliers) |
| Median | median() |
Robust central tendency |
| Standard Deviation | sd() |
Dispersion/Spread |
| Extrema | min(), max() |
Range of the distribution |
| Position | first(), last() |
Boundary observations |
| Count | n() |
Sample size (\(n\)) |
| Uniqueness | n_distinct() |
Cardinality check |
| Summation | sum(), cumsum() |
Total and cumulative flows |
library(dplyr)
# Summary of the entire table
data %>%
summarise(
avg_income = mean(income, na.rm = TRUE),
total_obs = n()
)
# Strategic aggregation by group
data %>%
group_by(education_level) %>%
summarise(
n = n(),
med_age = median(age, na.rm = TRUE),
max_seniority = max(seniority, na.rm = TRUE)
)Tip: Missing Values
In R, most statistical functions return NA if a single missing value is present. Always include na.rm = TRUE to ensure your summary is computed on the available data.
mutate() vs summarise()Both functions create new variables, but they operate on different “dimensions” of your data.
| Function | Action | Resulting row count |
|---|---|---|
mutate() |
Transformation | Remains unchanged (\(n = N\)). |
summarise() |
Aggregation | Reduced to one row per group (\(n = G\)). |
mutate() vs summarise()A simple rule of thumb:
mutate() if you want to keep the individual granularity.
summarise() if you want to move to a higher level of analysis.
Statisticians frequently use group_by() %>% mutate() to perform standardization or relative positioning without losing the individual granularity.
summarise(): You calculate the mean of the group. The individuals are gone.mutate(): You calculate the mean of the group and broadcast it back to every individual in that group.mean(), sd()) assume equal probability. On weighted data, they produce scientifically biased results.HmiscThe Hmisc package provides the simplest toolkit for calculating standard weighted indicators (Mean, Variance, Quantile, Rank).
dplyr: These functions are designed to fit perfectly within a summarise() pipeline.wtd. (weighted).weights Argument: Unlike standard functions, you must explicitly pass your weight variable.Hmiscwith)Ideal for quick, one-off calculations on a specific variable.
library(Hmisc)
# Basic weighted statistics
with(t, wtd.mean(y, weights = p)) # Mean
with(t, wtd.std(y, weights = p)) # Standard Deviation
with(t, wtd.quantile(y, weights = p, probs = 0.5, type = 'quantile')) # MedianBest for producing comparative tables across different population groups.
Any estimate based on survey data is surrounded by uncertainty. Sampling theory provides the mathematical framework to quantify this margin of error.
survey package is the industry standard in R. It handles both simple and complex sampling plans.In practice, calculating the true variance can be difficult due to data availability:
Why use the survey package?
It is specifically designed to handle both approaches. Whether you have the stratification variables or replicate weights, survey ensures your standard errors and confidence intervals are scientifically valid.
In a professional or national statistical context, estimating variance must reflect the entire data processing chain, not just the initial sampling.
General tools often overlook two critical steps that significantly impact the final precision of indicators:
gustaveTo address these specificities, Insee (the French National Statistical Institute) developed the gustave package. It is now the internal standard for official French statistics.
qvar(), which allows for variance estimation tailored to the sophisticated treatments applied to national surveys.Survey estimation is primarily about using weights correctly to reflect the national population. For most descriptive tasks, the Hmisc package is more than enough.
However, if you need to calculate precision (e.g., to build confidence intervals for a report on poverty or health), you face two practical scenarios:
Scenario 1: Using replicate weights If your dataset includes replicate weights (which summarize the entire sampling process), the survey package is your best tool. This is common in many international datasets.
Scenario 2: Analytical variance estimation If you need to estimate variance analytically based on the specific sampling design (as is the standard for Insee surveys in France or complex national surveys), the gustave package is the most suitable choice.
Hmisc for quick weighted means and medians.survey or gustave.Introduction to quantitative methods with , Lesotho Bureau of Statistics (back to main page)