Hypothesis Testing & NHST: Definition, Steps, Tips, Examples

Get Published
Getting your Trinity Audio player ready...
Summarize this Blog with AI

Contents

What Is Hypothesis Testing?

Hypothesis testing is a statistical method used to determine whether there is enough evidence in sample data to draw conclusions about a population. Instead of collecting data from an entire population, you take a sample and test whether the evidence supports or contradicts an assumption about that population.

In everyday terms: you have a hunch, you collect data, and you ask whether the data are consistent with “nothing is going on” or whether they strain credulity enough to suggest something real is happening.

For example, if a company says its website gets 50 visitors each day on average, hypothesis testing can be used to look at past visitor data and see if this claim is true or if the actual number is different.  

In the social and biomedical sciences, the stakes are higher. We use NHST to ask things like:

  • Does a new antidepressant actually reduce symptoms more than a placebo?
  • Do boys and girls differ in academic self-efficacy?
  • Does social isolation increase mortality risk?

What Is Null Hypothesis Significance Testing (NHST)?

Null Hypothesis Significance Testing is the specific, formalised framework that underlies the vast majority of statistical tests published in scientific journals. While “hypothesis testing” is a broad term covering many philosophies, NHST refers to one particular procedure: you begin by assuming the null hypothesis is true, collect data, compute a test statistic, and then ask how probable your observed result would be under that assumption.

The word null is key. It does not mean “zero” in a casual sense — it means the hypothesis of no effect, no difference, no relationship. Everything in NHST is organised around building a case against this default position.

The Four Core Components of NHST

ComponentRole in NHST
Null hypothesis (H₀)The default claim of “no effect” that must be disproven
Alternative hypothesis (H₁)The research claim the investigator hopes the data will support
Test statisticA number summarising how far the sample result is from what H₀ predicts
P valueThe probability of observing a result this extreme or more extreme, given that H₀ is true

How NHST Differs from Simply “Testing a Hypothesis”

It is worth distinguishing NHST from the broader scientific notion of hypothesis testing. Scientists test hypotheses all the time through prediction, experimentation, and observation. NHST is specifically a probabilistic decision rule applied to sample data. Its output is not a verdict on whether a theory is correct; it is a statement about whether the data are unusual enough, under a particular null model, to warrant further scrutiny. A p value below 0.05 is not a discovery but instead it is a signal that the null hypothesis struggles to explain your data.

Why “Null Hypothesis” and Not Just “Hypothesis”?

The null is set up as a straw man precisely because it is falsifiable in a probabilistic sense. You cannot prove that a drug works: there are infinite ways it could work, at varying magnitudes. But you can ask a narrow, testable question: is the observed improvement consistent with pure chance? If the answer is “barely,” you reject the null. The burden of proof sits firmly with the null, and the researcher accumulates evidence against it.

The Two Schools of Thought Behind NHST

There are two classical schools of thought on how best to use the p-value: the Fisher school and the Neyman-Pearson school. There is also a Bayesian way to interpret the p-value, but that presents a whole other set of dilemmas.  

SchoolCore IdeaHow Decision Is Made
FisherP value = continuous measure of evidence against H₀Smaller p = stronger evidence; no fixed threshold
Neyman-PearsonPre-specify α; control long-run error ratesReject or don’t reject based on α threshold
BayesianUpdate prior beliefs with new dataPosterior probability, Bayes factors

Modern practice, especially in journals, blends Fisher and Neyman-Pearson, often awkwardly. Understanding which framework you are working in matters enormously for interpretation.

Core Concepts and Key Terms

What are Null and Alternative Hypotheses?

The null hypothesis (H₀) is a statement of “no difference,” “no association,” or “no treatment effect.” The alternative hypothesis (Hₐ) is a statement of “difference,” “association,” or “treatment effect.” H₀ is assumed to be true until proven otherwise. However, Hₐ is the hypothesis the researcher hopes to bolster.  

TermSymbolPlain-English Meaning
Null hypothesisH₀“Nothing is going on; any observed difference is chance”
Alternative hypothesisH₁ / Hₐ“Something real is happening”
Significance levelαThe false-positive rate you are willing to tolerate
P valuepProbability of observing these data (or more extreme) if H₀ were true
Test statisticZ, t, χ², FHow many standard errors your sample result sits from H₀
Critical valueThe test-statistic threshold that demarcates “reject” from “fail to reject”
Degrees of freedomdfA count tied to sample size; used to find the correct reference distribution

What is the p value?

The P value answers the question: “If the null hypothesis were true, what is the probability of observing the current data or data that is more extreme?” Note that the P value is NOT the probability that the hypothesis (or any other hypothesis) is right or wrong. In fact, it assumes the null hypothesis is right!  

This distinction is crucial and perpetually misunderstood. More on it in the misconceptions section below.

Significance Level (α)

The significance level (α) represents how sure we want to be before saying the claim is false. Usually, we choose 0.05 (5%). Choosing α = 0.05 means accepting a 5% chance of wrongly rejecting a true null hypothesis, i.e., a false alarm.  

In biomedical contexts where a wrong decision could harm patients, researchers often set α = 0.01 or even 0.001.

One-Tailed vs. Two-Tailed Tests

A one-tailed test is used when we expect a change in only one direction: either up or down, but not both. A two-tailed test is used when we want to see if there is a difference in either direction, higher or lower.  

Test TypeWhen to UseExample Hypothesis
Right-tailedExpecting an increaseH₁: μ > 50
Left-tailedExpecting a decreaseH₁: μ < 50
Two-tailedAny difference, direction unknownH₁: μ ≠ 50

Social science example: A sociologist testing whether immigrants score differently (not just higher or lower) on a civic knowledge test compared to native-born citizens would use a two-tailed test, since the direction of difference is theoretically uncertain.

Biomedical example: A pharmacologist testing whether a new antihypertensive lowers blood pressure (not raises it) would use a one-tailed (left-tailed) test.

Types of Statistical Tests in NHST

Choosing the wrong test is one of the most common errors in applied research. The decision depends on the type of data (continuous vs. categorical), the number of groups, and whether population variance is known.

TestData TypeGroupsWhen to UseExample
Z-testContinuous1 or 2Large sample (n > 30), known population SDComparing national exam mean to a known standard
One-sample t-testContinuous1Small sample, unknown population SDTesting if a clinic’s mean wait time differs from 30 min
Independent samples t-testContinuous2Comparing means of two unrelated groupsDepression scores in therapy group vs. control
Paired t-testContinuous2 (related)Same subjects measured twiceBlood pressure before vs. after drug
Chi-square testCategorical2+Association between categorical variablesGender vs. vaccine hesitancy (Yes/No)
ANOVAContinuous3+Comparing means of ≥3 groupsAnxiety scores across 3 therapy modalities
One-tailed testsAnyAnyDirectional hypothesis is pre-specifiedNew drug expected to reduce tumour size

Step-by-Step Guide to Hypothesis Testing

Every NHST follows the same logical structure. Below is the canonical seven-step procedure:

  • Step 1: State the hypotheses. Define H₀ and H₁ in precise, testable terms before looking at the data.
  • Step 2: Choose the significance level (α). Pre-specify α, usually 0.05. Changing it after seeing results invalidates the test.
  • Step 3: Select the appropriate statistical test. Match the test to your data structure (see table above).
  • Step 4: Collect and organize the data. Gather a representative sample. Poor data quality produces misleading p values regardless of the test.
  • Step 5: Compute the test statistic. Calculate how far your sample result lies from what H₀ predicts, in units of standard error.
  • Step 6: Determine the p value and make a decision. If p-value ≤ α → reject H₀. If p-value > α → insufficient evidence to reject H₀, which is not proof that H₀ is true.  
  • Step 7: Interpret results in plain language. Report the effect size, direction of difference, and p value. State the conclusion in the context of the original research question.

Worked Examples

Biomedical Example: Does a New Drug Lower Blood Pressure?

A pharmaceutical team recruits 10 hypertensive patients and measures systolic blood pressure before and after a 4-week course of a new antihypertensive.

  • H₀: The drug has no effect on blood pressure (mean difference = 0)
  • H₁: The drug reduces blood pressure (mean difference < 0)
  • Test: Paired t-test (same patients measured twice)

Using a paired t-test, with before-treatment values averaging around 122 mmHg and after-treatment values around 117 mmHg, the t-statistic is approximately -9. With degrees of freedom = 9, the p-value is approximately 0.0000085: far below the significance threshold of 0.05. The researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment differs.  

Social Science Example: Gender and Voting Preferences

A political scientist wants to know whether gender and voting preference (Candidate A vs. Candidate B) are related in a random sample of 400 voters.

  • H₀: Gender and voting preference are independent
  • H₁: Gender and voting preference are associated
  • Test: Chi-square test of independence
Votes for AVotes for BTotal
Men95105200
Women13070200
Total225175400

If the chi-square statistic yields p = 0.003 < 0.05, H₀ is rejected. The researcher concludes there is a statistically significant association between gender and candidate preference in this sample.

Social Science Example: Comparing Anxiety Across Three Therapy Modalities

A clinical psychologist recruits 90 patients with generalized anxiety disorder and randomly assigns them to cognitive-behavioural therapy (CBT), mindfulness-based therapy (MBT), or a waitlist control. Post-treatment anxiety scores are compared.

  • H₀: Mean anxiety scores are equal across all three groups (μ₁ = μ₂ = μ₃)
  • H₁: At least one group mean differs
  • Test: One-way ANOVA (three groups, continuous outcome)

If F(2, 87) = 8.4, p = 0.0004 < 0.05, H₀ is rejected. Post-hoc tests (e.g., Tukey’s HSD) then identify which pairs of groups differ significantly.

Biomedical Example: Exact Binomial Test for Treatment Efficacy

Suppose a treatment has an expected success rate of 0.25. A researcher claims she has a new treatment with improved efficacy and tests it in 3 patients. If all 3 patients respond, P = Pr(X = 3) = 0.0156. This would be rare if the true success rate were only 25%, so the evidence against H₀ is deemed significant. If only 2 of 3 respond, P = Pr(X ≥ 2) = 0.1406 + 0.0156 = 0.1562. This observation is not unusual under H₀, so the evidence is deemed non-significant.  

What are Type I and Type II Errors?

Every NHST decision carries two possible error types. Understanding them is essential for designing studies and interpreting results responsibly.

A Type I error occurs when we reject the null hypothesis although that hypothesis was true. A Type II error occurs when we fail to reject the null hypothesis even though it is false.  

DecisionH₀ is Actually TrueH₀ is Actually False
Reject H₀❌ Type I Error (False Positive): rate = α✅ Correct (True Positive)
Fail to Reject H₀✅ Correct (True Negative)❌ Type II Error (False Negative): rate = β
  • Type I error (α): Concluding a new antidepressant works when it actually doesn’t: leading to unnecessary prescription and costs.
  • Type II error (β): Concluding a drug doesn’t work when it actually does: a missed therapeutic opportunity.
  • Statistical Power (1 − β): The probability of correctly detecting a real effect. Power is typically set at 0.80 in study planning, meaning researchers accept a 20% chance of missing a real effect.

The trade-off: Lowering α to reduce false positives increases β (more false negatives), and vice versa. The only way to reduce both simultaneously is to increase sample size.

Common Misconceptions About P Values

The interpretation of P values is a minefield. The man who introduced it as a formal research tool, the statistician and geneticist R.A. Fisher, could not explain exactly its inferential meaning. He proposed a rather informal system that could be used, but he never could describe straightforwardly what it meant from an inferential standpoint.  

Here are the most dangerous misconceptions, with corrections:

MisconceptionWhat People ThinkWhat Is Actually True
“p < 0.05 means the result is important”A small p means a big, important effectP values say nothing about effect size or practical importance
“p = 0.04 proves the alternative hypothesis”H₀ is false; H₁ is trueWe only conclude the data are unlikely under H₀; we don’t confirm H₁
“p = 0.06 means no effect exists”Failing to reject H₀ proves it is trueAbsence of evidence ≠ evidence of absence
“p is the probability H₀ is true”p = P(H₀ is true | data)p = P(data this extreme | H₀ is true): very different
“p < 0.05 is always the right threshold”0.05 is a universal law of natureα is a convention; the right threshold depends on the stakes
“Replication is guaranteed by a small p”The finding will reappear in future studiesA single p value makes no guarantee about reproducibility

Importance of Hypothesis Testing in Research

Hypothesis testing provides a structured, unbiased way to evaluate claims rather than relying solely on assumptions or intuition. It helps compare groups, treatments, or strategies to determine whether the differences between them are statistically meaningful. It does not eliminate uncertainty, but it helps measure and manage it using tools such as significance levels, p-values, and error rates.  

Its importance across disciplines is wide-ranging:

  • Clinical medicine: Determining whether a new drug, surgical procedure, or public health intervention actually improves outcomes before widespread adoption.
  • Public health: Testing whether a vaccination campaign reduces disease incidence; whether air pollution exposure is associated with respiratory illness rates.
  • Sociology and psychology: Examining whether socioeconomic status predicts educational attainment; whether implicit bias training changes hiring decisions.
  • Epidemiology: Evaluating risk factors for non-communicable diseases (e.g., does smoking independently predict lung cancer after controlling for confounders?).
  • Policy research: Measuring whether a minimum wage increase affects employment rates in affected regions.

Limitations of Hypothesis Testing

NHST has attracted intense criticism over the past few decades, especially in light of the replication crisis in psychology and biomedicine. Key limitations include:

  • Binary thinking: Forcing a rich continuum of evidence into “significant” or “not significant” loses information and encourages all-or-nothing interpretation.
  • P-hacking and researcher degrees of freedom: Flexible data collection, analysis choices, and selective reporting inflate the false-positive rate far above the nominal α.
  • The file-drawer problem and publication bias: Studies that fail to reject H₀ are less likely to be published, biasing the published literature toward positive findings.
  • Conflation of statistical and practical significance: A study of 100,000 patients may find that a drug lowers blood pressure by 0.5 mmHg with p < 0.0001: statistically overwhelming, clinically irrelevant.
  • Data quality dependence: The accuracy of the results depends on the quality of the data. Poor-quality or inaccurate data can lead to incorrect conclusions.  
  • Context limitations: Hypothesis testing doesn’t always consider the bigger picture, which can oversimplify results and lead to incomplete insights.  
  • Assumption violations: Most standard tests assume normally distributed data, independent observations, and equal variances. Violations can distort p values.

Best practices to mitigate these limitations:

  • Pre-register hypotheses and analysis plans (e.g., on OSF or ClinicalTrials.gov)
  • Report effect sizes and confidence intervals alongside p values
  • Use sufficiently powered studies (plan for ≥80% power)
  • Replicate findings before drawing firm conclusions
  • Consider Bayesian approaches or equivalence testing where appropriate

Key Takeaways

  • NHST is inferential: It generalizes from a sample to a population with a calculated degree of uncertainty; it never proves anything with certainty.
  • H₀ is the default: It states “no effect, no difference, no relationship.” Researchers try to accumulate evidence against it.
  • The p value is not what most people think: It is the probability of observing data this extreme assuming H₀ is true: not the probability that H₀ is true.
  • α is a pre-specified threshold: Commonly 0.05, but context-dependent. Lower α → fewer false positives, more false negatives.
  • Two error types exist: Type I (false positive, rate = α) and Type II (false negative, rate = β). They trade off against each other.
  • Test selection matters: Z-test, t-test, chi-square, ANOVA, and others are designed for specific data structures. Using the wrong test produces invalid results.
  • Statistical significance ≠ practical importance: A tiny p value in a large study may reflect a trivial effect.
  • “Fail to reject H₀” ≠ “H₀ is true”: Non-significant results are often underpowered, not proof of no effect.
  • Replication and pre-registration are essential complements to a single p value.
  • Effect sizes and confidence intervals should always accompany p values for a complete picture.

Frequently Asked Questions (FAQs)

What is statistical power, and why does it matter?

Statistical power (1 − β) is the probability that a test will correctly detect a true effect when one exists. A study with 50% power has only a coin-flip chance of finding a real effect. Low power wastes resources and produces unreliable findings. Power depends on sample size, effect size, and α. Most disciplines target at least 80% power during study design, requiring formal power calculations before data collection.

How is a confidence interval related to a hypothesis test?

A 95% confidence interval (CI) and a two-tailed test at α = 0.05 convey equivalent information: if the CI excludes the null value (e.g., zero for a mean difference), the corresponding p value will be below 0.05. CIs are often preferred because they communicate both the direction and the magnitude of the effect, not just whether it passed a threshold. Reporting both the p value and the CI is considered best practice.

What does it mean to “pre-register” a study?

Pre-registration means publicly documenting your hypotheses, data collection plan, and analysis strategy before collecting data, typically through platforms like ClinicalTrials.gov (biomedical) or the Open Science Framework (social sciences). This prevents researchers from unconsciously adjusting their hypotheses or analysis methods after seeing results (HARKing: Hypothesising After Results are Known), which inflates the false-positive rate and undermines reproducibility.

When should I use a non-parametric test instead of a t-test or ANOVA?

Parametric tests like t-tests and ANOVA assume the data are approximately normally distributed. When sample sizes are small and data are strongly skewed, heavily bounded (e.g., Likert scales with small n), or contain extreme outliers, non-parametric alternatives are more appropriate. Common examples include the Mann-Whitney U test (instead of independent t-test), Wilcoxon signed-rank test (instead of paired t-test), and Kruskal-Wallis test (instead of one-way ANOVA). Non-parametric tests sacrifice some statistical power in exchange for fewer distributional assumptions.

What is effect size, and which measures are commonly reported?

Effect size quantifies the magnitude of a difference or association, independent of sample size. Common measures include Cohen’s d (standardised mean difference; d = 0.2 small, 0.5 medium, 0.8 large), Pearson’s r (correlation), η² (eta-squared, for ANOVA), and odds ratios or relative risks (for categorical outcomes in clinical research). Reporting effect sizes alongside p values allows readers to judge whether a statistically significant finding is also practically or clinically meaningful.

What is the difference between a one-sample and two-sample test?

A one-sample test compares a single group’s mean (or proportion) to a known or hypothesised population value. Example: testing whether the mean birth weight in a hospital differs from the national standard of 3.2 kg. A two-sample test compares the means (or proportions) of two independent groups.

Example: testing whether mean depression scores differ between patients receiving CBT and those receiving pharmacotherapy. When the two sets of measurements come from the same individuals at different times (e.g., pre- and post-intervention), a paired test is used instead.

Why has NHST been criticised, and what are the proposed alternatives?

Critics argue that the rigid p < 0.05 threshold encourages dichotomous thinking, incentivises p-hacking, and obscures effect sizes. The American Statistical Association issued statements in 2016 and 2019 urging researchers to move beyond “statistically significant.” Proposed alternatives and complements include:

  • Bayesian inference (expressing results as updated probability distributions),
  • estimation-based approaches (reporting effect sizes and CIs without binary cutoffs),
  • equivalence testing (demonstrating that an effect is small enough to be unimportant),
  • and false discovery rate (FDR) control in large-scale genomic or neuroimaging studies.

Many journals now require effect sizes and CIs in addition to p values.

Can hypothesis testing be used with observational data, or only with experiments?

NHST applies to both experimental and observational data, but the conclusions that can be drawn differ. Randomised controlled trials (RCTs) allow causal inference: if the test rejects H₀, the intervention is likely the cause. In observational studies (e.g., survey data, cohort studies), NHST can detect associations but cannot establish causation because of potential confounding. A statistically significant association between coffee consumption and reduced Parkinson’s disease risk, for instance, does not by itself prove that coffee is protective because unmeasured lifestyle confounders may explain the association.

Related post

Featured post

Comment

There are no comment yet.

TOP