Section 1.10

Hypothesis Testing Logic

A hypothesis test is a structured way to answer "could this result just be a fluke?" The logic mirrors a courtroom: the null hypothesis is presumed innocent, and we only convict — reject it — when the evidence would be genuinely surprising if it were true.

The four moves

State two hypotheses. The null (H₀) is the boring "nothing's going on" claim (no effect, no difference). The alternative (H₁) is what you suspect instead.
Assume H₀ is true. This gives you a null distribution — the range of results you'd expect from chance alone if there really were no effect.
Measure surprise. Find where your actual result falls on that distribution. The p-value is the probability of getting a result this extreme or more, just by chance, if H₀ were true.
Decide. If the p-value is below a pre-set threshold α (usually 0.05), the result is "too surprising to be chance" — reject H₀. Otherwise, fail to reject it.

🎮 P-Value Explorer

The bell is the null distribution — what chance alone would produce. Slide your observed result and watch how much of the curve is "this extreme or worse" (the shaded p-value).

Observed statistic (z) +1.50

Alternative

p-value0.134

Threshold α0.05

DecisionFail to reject H₀

What a p-value is — and isn't

Slide the statistic out toward the tails and the p-value shrinks: extreme results are unlikely under chance, so they count as evidence against H₀. But the p-value is slippery, and two misreadings are everywhere:

A p-value is NOT the probability that H₀ is true. It's the probability of data this extreme assuming H₀ is true. Those are different questions. p = 0.03 does not mean "3% chance there's no effect."

"Fail to reject" is NOT "proven true." A courtroom's "not guilty" doesn't mean "definitely innocent" — just "not enough evidence." Likewise, a non-significant result means we couldn't rule out chance, not that H₀ is correct.

Two ways to be wrong

Type I error (false positive): rejecting H₀ when it's actually true. Its rate is exactly α — set α = 0.05 and you'll wrongly "find" an effect 5% of the time by chance.
Type II error (false negative): failing to detect a real effect. Lowering α (demanding stronger evidence) guards against Type I errors but makes Type II errors more likely — a genuine trade-off, explored in effect size & power.

Why it matters: every test you'll meet — t-tests, ANOVA, chi-square, regression — runs this exact playbook. Master the logic once here, and the rest is just swapping in different null distributions.