Hypothesis tests and confidence intervals for means, proportions, ANOVA, χ² tables and variances: test statistic, df, p-value, critical value, CI, effect size and a decision vs α, with the rejection region charted.
It is the bridge between a sample (what you measure) and the population (what you want to conclude). Instead of claiming an exact value, you quantify how much evidence there is and how much uncertainty remains, with two complementary tools:
Hypothesis test: do the data contradict a prior claim (H0)?
Confidence interval: which range of values is plausible for the parameter?
The test statistic
Almost every test boils down to one idea: measure how many standard errors separate what you observed from what H0 predicts.
statistic=standard errorestimate−value under H0
The same recipe covers proportions: the statistic is z=(p^−p0)/p0(1−p0)/n, with the standard error computed underH0 (which is why it uses p0 rather than p^).
For a mean with unknown σ you use Student's t with ν=n−1 degrees of freedom:
t=s/nxˉ−μ0
For two means, Welch's method (the recommended default) does not assume equal variances and adjusts the degrees of freedom:
Every test contrasts two scenarios: the Null Hypothesis (H0), which assumes "no effect" or "no difference", and the Alternative Hypothesis (Ha), which represents what we want to demonstrate. The shape of Ha determines how we read the statistic:
Two-tailed (=): We look for differences in any direction. p=2P(T≥∣t∣).
One-tailed (< or >): We look for a directional difference. p=P(T≥t) or P(T≤t).
The p-value is the probability of seeing a statistic at least as extreme as the one computed if H0 were true (the shaded area in the chart). The decision rule is direct: reject H0 when p<α. This is exactly equivalent to comparing the statistic with the critical value.
Type I and Type II Errors
The significance level α is not arbitrary: it is the maximum tolerance for a Type I Error (false positive, rejecting H0 when it is true). Reducing α (e.g., to 0.01) makes the test more stringent, but increases the risk of aType II Error (β): failing to detect a real effect (false negative). The complement 1−β is known as the power of the test.
Beyond means: χ² and F
Not everything is compared by subtracting. When what accumulates are squared deviations, the statistic can no longer be negative and its sampling distribution stops being symmetric:
χ² (chi-square): compares observed counts with expected ones (χ2=∑(O−E)2/E) or a sample variance with a reference one (χ2=(n−1)s2/σ02).
F: compares two variances as a ratio. ANOVA uses this idea to compare 3+ means: if the groups differ, the variation between groups exceeds the variation within them (F=MSB/MSW).
That is why ANOVA and the χ² tests on counts are right-tailed: only a large statistic signals disagreement with H0. Variance tests do admit two tails, but since χ² and F are not symmetric, the two critical values are not mirror images of each other.
The confidence interval
A 100(1−α)% CI gives the range of values compatible with the data. For a mean:
xˉ±t1−α/2,νns
There is a useful duality: in a two-tailed test, rejecting H0 at level α is the same as the H0 value falling outside the 100(1−α)% CI.
Significance ≠ effect size
A small p-value says the effect is detectable, not that it is large. That is why we also report the effect size (Cohen's d, d=(xˉ−μ0)/s): how many standard deviations the difference spans, independent of n.
Each family has its own: Cohen's h for proportions, η2 in ANOVA (the fraction of variation explained by the groups), w and Cramér's V for count tables. They all answer the same question: does the effect matter, beyond being statistically detectable?
What the p-value does (and doesn't) say
The p-value is not the probability that H0 is true, nor the probability of being wrong. It is how unusual the data are assuming H0. We also never "accept" H0: when p≥α there simply is not enough evidence to reject it.
Which test to use?
Question
Test
One mean vs. a value
1-sample t
Two independent groups
2-sample t (Welch)
Before vs. after (same subjects)
Paired t
σ known / large n
z
One or two proportions
Proportion z
3+ means at once
ANOVA (F)
Counts per category
χ² (fit / indep.)
One variance / two variances
χ² / F
Key Assumptions
For the p-value to be valid, the data must meet certain conditions:
Independence: Observations must not be correlated (fundamental for all tests).
Normality: Tests like the t assume a normal population, though with large samples (n≥30) the Central Limit Theorem relaxes this.
Sample size: For proportions and counts, at least 5 expected successes or frequencies are required.
Common critical values
α (two-tailed)
z\*
0.10
1.645
0.05
1.960
0.02
2.326
0.01
2.576
With the t distribution the critical value is a bit larger (heavier tails) and approaches these values as ν grows.