Section 1.13

Effect Size & Power

A p-value answers one narrow question: is there an effect at all? It says nothing about how big the effect is, or whether your study was even capable of finding it. Those two questions — answered by effect size and power — are what separate a meaningful result from a misleading one.

Effect size: how big, not just whether

With a huge sample, a trivially small difference can be "statistically significant." With a tiny sample, a huge difference can miss significance. Significance is tangled up with sample size — so we need a measure of the effect that isn't. That's effect size. For a difference between two means, the standard one is Cohen's d: the gap between the means measured in standard deviations.

d = (mean₁ − mean₂) / standard deviation

Rough conventions: d ≈ 0.2 is small, 0.5 medium, 0.8 large. It's the same currency as a z-score — a standardized distance — so it's comparable across studies and scales.

🎮 Effect Size & Power Explorer

Two groups, one real difference. The curves show the effect size (their overlap); the readout shows your power — the chance a study with this n actually detects the effect at α = .05.

Effect size (Cohen's d) 0.50

Sample size per group n = 30

Cohen's d0.50

Effect labelmedium

Power—

Type II error (miss)—

Power: could you even detect it?

Power is the probability that your study correctly rejects the null when a real effect exists — your chance of not missing it. The convention is to aim for at least 80%. Play with the sliders and three levers emerge:

Bigger effect → more power. Slide d up: the curves pull apart, overlap shrinks, and a real difference becomes easy to catch. Tiny effects are genuinely hard to detect and need lots of data.

Bigger sample → more power. Slide n up and power climbs even with d fixed. More data sharpens the estimate, so smaller effects become detectable.

Why underpowered studies are dangerous

If power is only 40%, you'll miss a real effect more often than you find it — and a "non-significant" result tells you almost nothing. Worse, the few significant results that do squeak through tend to overestimate the effect. This is why researchers run a power analysis before collecting data: pick the effect size you care about, the power you want (say 80%), and solve for the sample size you need.

Why it matters: reporting an effect size alongside the p-value, and planning for adequate power, is the difference between research that replicates and research that doesn't. Significance is the start of the story, never the whole of it.