Hypothesis Testing Logic
A hypothesis test is a structured way to answer "could this result just be a fluke?" The logic mirrors a courtroom: the null hypothesis is presumed innocent, and we only convict — reject it — when the evidence would be genuinely surprising if it were true.
The four moves
- State two hypotheses. The null (H₀) is the boring "nothing's going on" claim (no effect, no difference). The alternative (H₁) is what you suspect instead.
- Assume H₀ is true. This gives you a null distribution — the range of results you'd expect from chance alone if there really were no effect.
- Measure surprise. Find where your actual result falls on that distribution. The p-value is the probability of getting a result this extreme or more, just by chance, if H₀ were true.
- Decide. If the p-value is below a pre-set threshold α (usually 0.05), the result is "too surprising to be chance" — reject H₀. Otherwise, fail to reject it.
🎮 P-Value Explorer
The bell is the null distribution — what chance alone would produce. Slide your observed result and watch how much of the curve is "this extreme or worse" (the shaded p-value).
What a p-value is — and isn't
Slide the statistic out toward the tails and the p-value shrinks: extreme results are unlikely under chance, so they count as evidence against H₀. But the p-value is slippery, and two misreadings are everywhere:
A p-value is NOT the probability that H₀ is true. It's the probability of data this extreme assuming H₀ is true. Those are different questions. p = 0.03 does not mean "3% chance there's no effect."
"Fail to reject" is NOT "proven true." A courtroom's "not guilty" doesn't mean "definitely innocent" — just "not enough evidence." Likewise, a non-significant result means we couldn't rule out chance, not that H₀ is correct.
Two ways to be wrong
- Type I error (false positive): rejecting H₀ when it's actually true. Its rate is exactly α — set α = 0.05 and you'll wrongly "find" an effect 5% of the time by chance.
- Type II error (false negative): failing to detect a real effect. Lowering α (demanding stronger evidence) guards against Type I errors but makes Type II errors more likely — a genuine trade-off, explored in effect size & power.
Why it matters: every test you'll meet — t-tests, ANOVA, chi-square, regression — runs this exact playbook. Master the logic once here, and the rest is just swapping in different null distributions.