What is a P Value in Statistics and Hypothesis Testing? Definition, Calculations, Reporting Tips

Getting your Trinity Audio player ready...
Summarize this Blog with AI

Key Takeaways

  • What it measures: A p value is the probability of seeing results at least as extreme as yours if the null hypothesis were true; it is not the probability that your hypothesis is correct.
  • Significance is not importance: A statistically significant result, commonly p less than 0.05, does not automatically mean the finding is meaningful in practice. Effect size and confidence intervals reveal whether a result matters.
  • Two ways to be wrong: Type I errors (false positives) and Type II errors (false negatives) represent the two ways a hypothesis test can fail, and researchers manage the tradeoff between them through sample size, alpha level, and statistical power.
  • Reporting rules vary: APA, AMA, MLA, and Chicago style guides each have specific rules for formatting p values, and following the correct one improves the credibility of published research.

Contents

Glossary of Key Terms

The table below defines the core terms used throughout this guide, so you can refer back to them as needed.

TermDefinition
P valueThe probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true.
Null hypothesis (H0)The default claim that there is no effect or no difference between groups.
Alternative hypothesis (Ha)The claim that there is an effect or a difference, which the researcher is testing for.
Alpha levelThe threshold, commonly 0.05, below which a p value is considered statistically significant.
Statistical significanceA label applied when a p value falls below the chosen alpha level, indicating the result is unlikely to be due to chance alone.
Effect sizeA standardized measure of the magnitude of a difference or relationship, independent of sample size.
Type I errorRejecting a true null hypothesis; a false positive finding.
Type II errorFailing to reject a false null hypothesis; a false negative finding.
Statistical powerThe probability that a test correctly detects an effect when one truly exists.
Confidence intervalA range of plausible values for a population parameter, calculated from sample data.
Test statisticA standardized value, such as t, z, F, or chi square, calculated from sample data during a hypothesis test.
Inferential statisticsThe branch of statistics concerned with drawing conclusions about a population from a sample.

What Is a P Value?

A p value is the probability of obtaining results at least as extreme as the observed results, assuming the null hypothesis is true. It does not tell you the probability that your hypothesis is correct.

The p value comes from inferential statistics, the branch of statistics used to draw conclusions about a larger population based on a sample of data. When researchers run a hypothesis test, they calculate a test statistic from their sample and then determine how likely it would be to see a result that extreme, or more extreme, purely by chance if there were truly no effect in the population.

A small p value suggests the observed data would be unusual under the null hypothesis, which leads researchers to reject that null hypothesis in favor of the alternative. A large p value suggests the observed data are consistent with the null hypothesis, so there is not enough evidence to reject it.

  • It is a conditional probability: calculated assuming the null hypothesis is true, not a statement about how likely the null or alternative hypothesis is.
  • It ranges from 0 to 1: values closer to 0 indicate stronger evidence against the null hypothesis.
  • It depends on sample size: larger samples can produce small p values even for tiny, practically unimportant effects.

Understanding Hypothesis Testing

Hypothesis testing is the formal procedure researchers use to decide whether sample data provide enough evidence to support a claim about a population. The p value is the key output of this procedure, but it only makes sense within the broader testing framework.

What Is the Difference Between the Null and Alternative Hypothesis?

The null hypothesis (H0) states there is no effect or no difference, while the alternative hypothesis (Ha) states that an effect or difference does exist. A hypothesis test is designed to weigh evidence against the null.

Researchers never prove the null hypothesis true; they either reject it in favor of the alternative or fail to reject it because there is insufficient evidence. This distinction matters: failing to reject H0 is not the same as proving H0 is correct.

Steps in a Hypothesis Test

  1. State the null and alternative hypotheses clearly, based on the research question.
  2. Choose an alpha level, commonly 0.05, that defines how much risk of a false positive is acceptable.
  3. Select an appropriate statistical test, such as a t test, chi square test, or analysis of variance.
  4. Collect data and calculate the test statistic from the sample.
  5. Determine the p value associated with that test statistic.
  6. Compare the p value to the alpha level and decide whether to reject the null hypothesis.
  7. Report the result along with the effect size and confidence interval for full context.

How Is a P Value Calculated?

A p value is calculated by comparing an observed test statistic to a known probability distribution that describes what results would look like if the null hypothesis were true. The exact formula depends on the statistical test being used.

In general terms, the process works like this: the sample data are summarized into a single test statistic, that statistic is located on the relevant probability distribution such as the t distribution, z distribution, F distribution, or chi square distribution, and the p value is the area under that distribution beyond the observed value.

  • T test: used to compare means between one or two groups; produces a t statistic evaluated on the t distribution.
  • Chi square test: used for categorical data and tests of independence; produces a chi square statistic.
  • Analysis of variance (ANOVA): used to compare means across three or more groups; produces an F statistic.
  • Correlation and regression tests: used to assess relationships between variables; often report a p value alongside r or beta coefficients.

In practice, statistical software calculates the p value automatically once the test and data are specified, so researchers rarely compute it by hand.

Interpreting P Value Results

Interpreting a p value correctly means understanding what threshold was chosen and what that threshold does and does not imply about the underlying research question.

Common Significance Thresholds

P Value RangeCommon Interpretation
p less than 0.001Very strong evidence against the null hypothesis.
p less than 0.01Strong evidence against the null hypothesis.
p less than 0.05Moderate evidence against the null hypothesis; the conventional significance threshold.
p between 0.05 and 0.10Weak or marginal evidence; sometimes reported as a trend (though this is not advisable).
p greater than 0.10Little to no evidence against the null hypothesis.

What a Small P Value Means

A small p value, typically below the chosen alpha level, indicates that the observed data would be unlikely if the null hypothesis were true. This leads researchers to reject the null hypothesis and treat the result as statistically significant.

What Does a P Value Greater Than 0.05 Mean?

A p value above 0.05 means the observed data are reasonably consistent with the null hypothesis, so the researcher fails to reject it. This does not prove there is no effect; it only means the study did not find strong enough evidence of one.

Why Is Effect Size Important Alongside a P Value?

Effect size matters because it shows the actual magnitude of a difference or relationship, while a p value only indicates whether that difference is unlikely to be due to chance. A tiny, unimportant effect can still produce a very small p value in a large enough sample.

Reporting effect size alongside a p value allows readers to judge practical or clinical importance, not just statistical significance. Two studies can share the same p value yet describe effects of very different real-world magnitude.

Effect Size MeasureTypical UseRough Guide
Cohen’s dComparing two group means0.2 small, 0.5 medium, 0.8 large
Pearson’s rStrength of a linear relationship0.1 small, 0.3 medium, 0.5 large
Odds ratioComparing odds between two groupsCloser to 1 indicates a weaker association
Eta squaredVariance explained in ANOVA designs0.01 small, 0.06 medium, 0.14 large

Type I and Type II Errors

Every hypothesis test carries some risk of reaching the wrong conclusion. Statisticians describe these risks as Type I and Type II errors, and understanding both is essential for designing a sound study.

AspectType I ErrorType II Error
Also known asFalse positiveFalse negative
What happensRejecting a true null hypothesisFailing to reject a false null hypothesis
Controlled byThe alpha level, commonly set at 0.05Statistical power, related to sample size and effect size
Typical symbolAlphaBeta
ExampleConcluding a drug works when it actually does notConcluding a drug does not work when it actually does

Type I Error: The False Positive

A Type I error occurs when a researcher rejects a null hypothesis that is actually true, concluding an effect exists when it does not. The probability of this error is controlled directly by the alpha level chosen before the study begins.

Type II Error: The False Negative

A Type II error occurs when a researcher fails to reject a null hypothesis that is actually false, missing a real effect. The probability of this error, called beta, decreases as sample size and statistical power increase.

Why Does Sample Size Affect Statistical Power?

Larger samples produce more precise estimates, which makes it easier to detect a true effect and reduces the chance of a Type II error. Underpowered studies with small samples often miss real effects entirely.

Inferential Statistics and the Role of the P Value

Inferential statistics is the branch of statistics that uses sample data to make claims about a broader population, in contrast to descriptive statistics, which simply summarizes the data at hand. The p value is one of several inferential tools, alongside confidence intervals, standard errors, and effect size estimates.

A p value alone offers a binary decision: reject or do not reject the null hypothesis. A confidence interval, by contrast, offers a range of plausible values for the true effect, which often communicates more useful information. Many methodologists now recommend reporting both together rather than relying on the p value in isolation.

  • Descriptive statistics: summarize a dataset using measures such as mean, median, and standard deviation.
  • Inferential statistics: use sample data, probability theory, and tools such as the p value and confidence interval to generalize beyond the sample.
  • Confidence interval: shows the precision of an estimate; a narrow interval indicates a more precise estimate than a wide one.

Common Misconceptions About P Values

Because the p value is widely reported yet often misunderstood, several persistent misconceptions continue to circulate in research and popular writing.

  • Misconception: a p value tells you the probability the null hypothesis is true. Reality: the p value assumes the null hypothesis is true and describes the data, not the hypothesis.
  • Misconception: a p value tells you the probability your findings are due to chance. Reality: it describes how surprising the data would be under the null hypothesis, which is a related but distinct idea.
  • Misconception: p less than 0.05 means the effect is important. Reality: statistical significance depends on sample size and says nothing about practical importance without effect size.
  • Misconception: p greater than 0.05 proves there is no effect. Reality: it only means the study did not detect strong enough evidence of an effect.
  • Misconception: a smaller p value always means a stronger or more real effect. Reality: p values are heavily influenced by sample size, so smaller is not automatically better evidence.

Does a Small P Value Prove the Alternative Hypothesis Is True?

No, a small p value does not prove the alternative hypothesis is true. It only indicates that the observed data would be unlikely under the null hypothesis, which supports, but never proves, the alternative.

How to Report P Values in Different Style Guides

Academic and scientific writing follows specific formatting conventions for reporting p values, and the correct format depends on the style guide required by the journal, publisher, or institution.

APA Style

The Publication Manual of the American Psychological Association recommends reporting exact p values to two or three decimal places, without a leading zero, and using italics for the letter p.

  • Format: p = .032, or p < .001 when the value is extremely small.
  • No leading zero: write .032, not 0.032, since p cannot exceed 1.
  • Italicize statistical symbols: p, t, F, and similar symbols are italicized in APA formatted manuscripts.
  • Report alongside test statistics: for example, t(48) = 2.10, p = .041.

AMA Style

The American Medical Association style, common in medical and health journals, generally reports p values with a leading zero and specifies exact values rather than only significance thresholds when possible.

  • Format: P = 0.03, or P < .001 for very small values, following the specific journal’s convention.
  • Capitalization: AMA style commonly capitalizes the italic P, unlike APA style.
  • Precision: report exact values to two or three significant digits rather than only stating significant or not significant.

MLA and Chicago Style

MLA style is rarely used for quantitative research and offers no dedicated statistical reporting convention, so writers using MLA typically borrow APA conventions for numeric results. Chicago style similarly defers to discipline specific norms, most often following APA or AMA formatting for p values within scientific writing.

  • MLA guidance: present statistics in plain, readable sentences and follow APA numeric conventions when precision matters.
  • Chicago guidance: follow the notes and bibliography or author date system for citations, while using APA style numeric formatting for the p value itself.

Reporting P Values in Tables and Figures

Clear, consistent formatting in tables and figures helps readers interpret statistical results quickly and accurately.

  • Report exact values when possible: write p = .023 rather than only p < .05, unless the value is extremely small.
  • Use a consistent number of decimal places: typically two or three digits throughout a manuscript or report.
  • Use asterisks sparingly and define them: for example, one asterisk for p < .05, two for p < .01, three for p < .001, with a note below the table.
  • Never report p = .000: software rounding can produce this, but the correct format is p < .001.
  • Pair p values with effect sizes: include a column for effect size or confidence interval whenever space allows.

Best Practices for Using P Values in Research

  • Set the alpha level before collecting data: deciding on a threshold in advance avoids the temptation to adjust it after seeing results.
  • Report exact p values, not just significance labels: this allows readers to judge the strength of evidence themselves.
  • Always pair the p value with an effect size and confidence interval: statistical significance alone is an incomplete picture.
  • Avoid p hacking: running many tests and reporting only the significant ones inflates the true Type I error rate.
  • Preregister hypotheses when possible: specifying hypotheses and analysis plans before data collection improves the credibility of a p value.
  • Consider the sample size: interpret p values in light of how much statistical power the study actually had.

What Is P Hacking?

P hacking (also called data dredging or significance chasing) is the practice of manipulating data collection or analysis, whether deliberately or unknowingly, until a result crosses the p < 0.05 threshold. It turns a p value from an honest measure of evidence into a number that has been “fished for,” which makes the finding far less trustworthy than it appears.

How It Happens

P hacking rarely looks like outright fraud. It usually creeps in through small, seemingly reasonable decisions made during analysis:

  • Trying multiple outcome variables and reporting only the one that turned out significant, without disclosing the others.
  • Peeking at the data as it comes in and stopping data collection as soon as a result becomes significant.
  • Testing many subgroups (by age, gender, region, and so on) until one shows a significant effect.
  • Adding or removing outliers based on whether doing so improves the p value, rather than on a predefined rule.
  • Trying different statistical tests or covariates and reporting only the combination that worked.
  • Rounding or reporting p values loosely, for example writing p = .049 as “significant” without full context.

Each of these choices might seem harmless in isolation. The problem is cumulative: every additional test or comparison increases the chance that something will appear significant purely by chance, even if no real effect exists.

Why It Is a Problem

The alpha level of 0.05 assumes a single, prespecified test. Once a researcher runs many tests and reports only the significant ones, the true Type I error rate (the false positive rate) climbs well above 5 percent, sometimes far above it. This means p hacked results are more likely to be flukes that will not replicate in future studies, which is part of why psychology, medicine, and other fields have faced a well known replication crisis.

How Researchers Guard Against It

  • Preregistration: specifying hypotheses, outcome measures, and analysis plans before collecting data, so choices cannot be adjusted after seeing results. Registered reports are a strong countermeasure against p hacking.
  • Correcting for multiple comparisons: using methods such as the Bonferroni correction to adjust the significance threshold when many tests are run.
  • Reporting all analyses: disclosing every test performed, not just the significant ones.
  • Separating exploratory from confirmatory analysis: clearly labeling which findings were hypothesized in advance versus discovered by exploring the data.
  • Focusing on effect size and replication: treating a single significant p value as a starting point for further testing, not final proof.

Frequently Asked Questions

What Does a P Value of 0.05 Mean in Statistics?

A p value of 0.05 means there is a 5 percent probability of observing a result at least as extreme as the one found, assuming the null hypothesis is true. It sits at the conventional threshold for statistical significance.

Is a P Value of 0.01 More Significant Than 0.05?

A p value of 0.01 indicates stronger evidence against the null hypothesis than a p value of 0.05, since it reflects a lower probability of observing the data by chance alone. However, this is about statistical significance and a smaller p value doesn’t mean more real-world importance. Obtaining a p value of .03 (i.e., <.05) for a 30% reduction in triglyceride levels is better than obtaining a p value of .0001 for a 2% reduction in triglyceride levels.

Can a P Value Be Exactly Zero?

In practice, no; a p value can be extremely small, but software rounding sometimes displays 0.000. The correct way to report this is p < .001 rather than p = .000, since the true probability is never exactly zero.

What Is Considered a Statistically Significant P Value in Medical Research?

Most medical research uses an alpha level of 0.05, so a p value below 0.05 is generally considered statistically significant, though some fields, such as genomics, require much stricter thresholds due to multiple comparisons.

How Do You Report a P Value in APA 7th Edition Format?

APA 7th edition format reports exact p values without a leading zero, in italics, such as p = .032, or p < .001 for very small values, typically alongside the relevant test statistic and degrees of freedom.

What Is the Difference Between a P Value and a Confidence Interval?

A p value gives a single probability tied to a specific null hypothesis, while a confidence interval provides a full range of plausible values for the true effect, which often communicates more practical information.

Why Do Some Studies Report P Less Than 0.001 Instead of an Exact Value?

Researchers report p < .001 when the calculated value is smaller than that threshold, since listing many additional decimal places adds little useful information and follows standard reporting conventions.

Can You Have a Statistically Significant P Value With a Small Effect Size?

Yes, this happens often in large samples, where even a tiny, practically unimportant difference can produce a p value below 0.05. This is exactly why effect size should always be reported alongside the p value.

Related post

Featured post

Comment

There are no comment yet.

TOP