Statistical Inference

Hypothesis tests and confidence intervals for means, proportions, ANOVA, χ² tables and variances: test statistic, df, p-value, critical value, CI, effect size and a decision vs α, with the rejection region charted.

Choose the test

Data

Mean (x̄)

Std. deviation (s)

Size (n)

Mean under H₀ (μ₀)

Test

Alternative hypothesis

Significance level (α)

Observed statisticCritical valueRejection region

Result

Reject H₀

p-value = 0.01424 vs. α = 0.05

Statistic

2.608

p-value

0.01424

Point estimate: 5
Standard error: 0.3834
Degrees of freedom: 29
Critical value: ±2.045
95% CI: [4.216, 5.784]
Effect size (Cohen's d): 0.4762

Fundamentals & Explanation

What is statistical inference?

It is the bridge between a sample (what you measure) and the population (what you want to conclude). Instead of claiming an exact value, you quantify how much evidence there is and how much uncertainty remains, with two complementary tools:

Hypothesis test: do the data contradict a prior claim ( $H_0$ )?
Confidence interval: which range of values is plausible for the parameter?

The test statistic

Almost every test boils down to one idea: measure how many standard errors separate what you observed from what $H_0$ predicts.

\text{statistic}=\dfrac{\text{estimate}-\text{value under }H_0}{\text{standard error}}

The same recipe covers proportions: the statistic is $z=(\hat{p}-p_0)/\sqrt{p_0(1-p_0)/n}$ , with the standard error computed under $H_0$ (which is why it uses $p_0$ rather than $\hat{p}$ ).

For a mean with unknown $\sigma$ you use Student's t with $\nu=n-1$ degrees of freedom:

t=\dfrac{\bar{x}-\mu_0}{s/\sqrt{n}}

For two means, Welch's method (the recommended default) does not assume equal variances and adjusts the degrees of freedom:

t=\dfrac{(\bar{x}_1-\bar{x}_2)-\Delta_0}{\sqrt{s_1^2/n_1+s_2^2/n_2}}

\nu=\dfrac{\left(s_1^2/n_1+s_2^2/n_2\right)^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1}+\dfrac{(s_2^2/n_2)^2}{n_2-1}}

Hypotheses, p-value and decision

Every test contrasts two scenarios: the Null Hypothesis ( $H_0$ ), which assumes "no effect" or "no difference", and the Alternative Hypothesis ( $H_a$ ), which represents what we want to demonstrate. The shape of $H_a$ determines how we read the statistic:

Two-tailed ( $\neq$ ): We look for differences in any direction. $p = 2\,P(T \ge |t|)$ .
One-tailed ( $<$ or $>$ ): We look for a directional difference. $p = P(T \ge t)$ or $P(T \le t)$ .

The p-value is the probability of seeing a statistic at least as extreme as the one computed if $H_0$ were true (the shaded area in the chart). The decision rule is direct: reject $H_0$ when $p < \alpha$ . This is exactly equivalent to comparing the statistic with the critical value.

Type I and Type II Errors

The significance level $\alpha$ is not arbitrary: it is the maximum tolerance for a Type I Error (false positive, rejecting $H_0$ when it is true). Reducing $\alpha$ (e.g., to 0.01) makes the test more stringent, but increases the risk of aType II Error ( $\beta$ ): failing to detect a real effect (false negative). The complement $1-\beta$ is known as the power of the test.

Beyond means: χ² and F

Not everything is compared by subtracting. When what accumulates are squared deviations, the statistic can no longer be negative and its sampling distribution stops being symmetric:

χ² (chi-square): compares observed counts with expected ones ( $\chi^2=\textstyle\sum (O-E)^2/E$ ) or a sample variance with a reference one ( $\chi^2=(n-1)s^2/\sigma_0^2$ ).
F: compares two variances as a ratio. ANOVA uses this idea to compare 3+ means: if the groups differ, the variation between groups exceeds the variation within them ( $F=\mathrm{MSB}/\mathrm{MSW}$ ).

That is why ANOVA and the χ² tests on counts are right-tailed: only a large statistic signals disagreement with $H_0$ . Variance tests do admit two tails, but since χ² and F are not symmetric, the two critical values are not mirror images of each other.

The confidence interval

A $100(1-\alpha)\%$ CI gives the range of values compatible with the data. For a mean:

\bar{x}\;\pm\;t_{1-\alpha/2,\,\nu}\,\dfrac{s}{\sqrt{n}}

There is a useful duality: in a two-tailed test, rejecting $H_0$ at level $\alpha$ is the same as the $H_0$ value falling outside the $100(1-\alpha)\%$ CI.

Significance ≠ effect size

A small p-value says the effect is detectable, not that it is large. That is why we also report the effect size (Cohen's $d$ , $d=(\bar{x}-\mu_0)/s$ ): how many standard deviations the difference spans, independent of $n$ .

Each family has its own: Cohen's $h$ for proportions, $\eta^2$ in ANOVA (the fraction of variation explained by the groups), $w$ and Cramér's $V$ for count tables. They all answer the same question: does the effect matter, beyond being statistically detectable?

What the p-value does (and doesn't) say

The p-value is not the probability that $H_0$ is true, nor the probability of being wrong. It is how unusual the data are assuming $H_0$ . We also never "accept" $H_0$ : when $p \ge \alpha$ there simply is not enough evidence to reject it.

Which test to use?

Question	Test
One mean vs. a value	1-sample t
Two independent groups	2-sample t (Welch)
Before vs. after (same subjects)	Paired t
$\sigma$ known / large n	z
One or two proportions	Proportion z
3+ means at once	ANOVA (F)
Counts per category	χ² (fit / indep.)
One variance / two variances	χ² / F

Key Assumptions

For the p-value to be valid, the data must meet certain conditions:

Independence: Observations must not be correlated (fundamental for all tests).
Normality: Tests like the t assume a normal population, though with large samples ( $n \ge 30$ ) the Central Limit Theorem relaxes this.
Sample size: For proportions and counts, at least 5 expected successes or frequencies are required.

Common critical values

$\alpha$ (two-tailed)	$z^\*$
0.10	1.645
0.05	1.960
0.02	2.326
0.01	2.576

With the t distribution the critical value is a bit larger (heavier tails) and approaches these values as $\nu$ grows.

Statistical Inference

Data

Test

Result

Fundamentals & Explanation

What is statistical inference?

The test statistic

Hypotheses, p-value and decision

Type I and Type II Errors

Beyond means: χ² and F

The confidence interval

Significance ≠ effect size

What the p-value does (and doesn't) say

Which test to use?

Key Assumptions

Common critical values

You may also like:

What is statistical inference?

The test statistic

Hypotheses, p-value and decision

Type I and Type II Errors

Beyond means: χ² and F

The confidence interval

Significance ≠ effect size

What the p-value does (and doesn't) say

Which test to use?

Key Assumptions

Common critical values