Chapter 7. Inductive Arguments and Statistics
§4 Statistical Pitfalls
Simpson’s Paradox and Data Manipulation
Statistics are often presented as “hard facts,” yet the way data is aggregated, sliced, and presented can fundamentally change the conclusion a “Reasonable Person” might draw. One of the most counter-intuitive and academically significant phenomena in this field is Simpson’s Paradox.
4.1 Defining Simpson’s Paradox
Simpson’s Paradox occurs when a trend appears in several different groups of data but disappears or reverses when these groups are combined. This paradox demonstrates that a causal relationship observed at a local level can be completely misrepresented at a global level due to “lurking variables” (confounding factors).
-
The Mathematical Root: The paradox arises because of unequal group sizes and the way weighted averages function. If one group is significantly larger than another, its results will dominate the aggregate data, potentially masking the success or failure of the smaller groups.
4.2 The Classic Case: UC Berkeley (1973)
The most famous academic example of Simpson’s Paradox occurred in an investigation into gender bias in graduate admissions at the University of California, Berkeley.
-
The Global Data: The aggregate data showed that men were being admitted at a significantly higher rate than women. This led to a preliminary conclusion of systemic gender discrimination.
-
The Local Data: When researchers looked at individual departments, they found a shocking reversal: in most departments, women actually had a higher admission rate than men.
-
The “Lurking Variable”: The paradox was explained by the fact that women tended to apply to highly competitive departments with low overall admission rates (like English), while men applied to less competitive departments with higher admission rates (like Engineering).
-
The Conclusion: There was no bias in the decision-making of the departments; rather, there was a difference in application patterns.
4.3 Other Statistical Pitfalls
Beyond Simpson’s Paradox, critical thinkers must be wary of several common ways statistics are used to mislead:
A. The “Average” Trap
The term “average” is often used ambiguously. When someone says “the average salary,” they could mean:
-
Mean: The sum of all values divided by the number of values (heavily influenced by outliers/extreme wealth).
-
Median: The middle value in a list (better represents the “typical” person).
-
Mode: The most frequent value.
A critical thinker always asks: “Which average are you using, and why?”
B. Truncated Graphs (Visual Deception)
In media and advertising, graphs are often manipulated to exaggerate small differences. By “truncating” the y-axis (starting the scale at 40% instead of 0%), a 1% difference can be made to look like a massive gap.
C. Correlation vs. Causation
As discussed in the Post Hoc fallacy (Chapter 4), statistics can show that two things happen together (correlation) without proving that one causes the other. Philosophers of science use Mill’s Methods to determine if a “third variable” is actually responsible for both events.
§4 Summary Table: Navigating Data Deception
| Pitfall | Description | Critical Thinking Check |
| Simpson’s Paradox | Trends reverse when data is aggregated. | “Does this trend hold true when we look at individual sub-groups?” |
| Mean vs. Median | Using the “Mean” to hide inequality. | “Are there extreme outliers skewing this average?” |
| Truncated Graphs | Exaggerating differences visually. | “Check the Y-axis: Does it start at zero?” |
| Spurious Correlation | Meaningless statistical coincidences. | “Is there a plausible causal mechanism connecting these two?” |