Effect Size in Research: Definition, Calculation, Reporting, Examples

Getting your Trinity Audio player ready...

Contents

Glossary of Key Terms

The following terms are used throughout this guide. Familiarity with them before reading will aid comprehension.

TermDefinition
Absolute Effect SizeThe raw, unstandardized difference between group outcomes (e.g., a mean difference of 5 mg/dL). Useful when the measurement scale has inherent meaning.
Standardized Effect SizeA unitless index that expresses the magnitude of an effect relative to variability, enabling comparisons across studies using different scales.
Cohen’s dA standardized mean difference: the difference between two group means divided by the pooled standard deviation. Widely used in experimental research.
Hedge’s gA variant of Cohen’s d corrected for bias in small samples (n < 20). Preferred when sample sizes are small or unequal.
Pearson’s rA correlation coefficient ranging from −1 to +1, expressing the strength and direction of a linear relationship between two continuous variables.
R² (R-squared)The proportion of variance in the outcome variable explained by the predictor(s) in a regression model; ranges from 0 to 1.
Eta-squared (η²)Proportion of total variance in an outcome attributable to a factor in ANOVA. Analogous to R².
Partial Eta-squared (ηp²)Like η², but accounts for variance explained by other factors in the model; typically larger than η² in multi-factor designs.
Omega-squared (ω²)A less biased, population-level estimate of variance explained; preferred over η² for smaller samples.
Odds Ratio (OR)The ratio of the odds of an event occurring in one group versus another. Common in case-control and logistic regression studies.
Relative Risk (RR)The ratio of the probability of an event in the exposed group to the probability in the unexposed group. Used in cohort studies and RCTs.
Absolute Risk Reduction (ARR)The arithmetic difference in event rates between control and treatment groups. Directly communicates clinical benefit.
Number Needed to Treat (NNT)The number of patients that must receive a treatment for one additional patient to benefit; calculated as 1/ARR.
Hazard Ratio (HR)Ratio of the hazard rate (instantaneous risk of an event) in one group versus another over time; common in survival analysis.
Statistical PowerThe probability that a study will correctly detect a true effect when one exists. Conventionally set at ≥ 0.80.
Type I Error (α)Incorrectly rejecting a true null hypothesis (a false positive). Conventionally set at p < 0.05.
Type II Error (β)Failing to detect a true effect (a false negative). Power = 1 − β.
Confidence Interval (CI)A range of values within which the true population effect is estimated to lie with a specified probability (usually 95%).
Meta-analysisA statistical method that pools effect sizes from multiple independent studies to produce a combined, more precise estimate.
Publication BiasThe tendency for studies with statistically significant results to be published more often, inflating apparent effect sizes in the literature.

Key Takeaways

  • Effect size quantifies how large or meaningful an effect is: it answers “how much?” where a p-value only answers “is there something?”
  • Statistical significance depends heavily on sample size; a tiny, clinically meaningless difference can reach p < 0.001 in a large enough study.
  • Effect size is independent of sample size, making it the more honest descriptor of a finding’s real-world importance.
  • Always report both effect size and its confidence interval alongside your p-value. Neither metric alone tells the full story.
  • Choose your effect size measure to match your study design: Cohen’s d for comparing two means, OR/RR for binary outcomes, r for correlations, η² or ω² for ANOVA.
  • In biomedical research, absolute effect size measures (ARR, NNT, mean difference in clinical units) are often more clinically interpretable than standardized ones.
  • Cohen’s “small/medium/large” benchmarks (d = 0.2 / 0.5 / 0.8) are rough defaults, not universal standards: what is large in one field may be trivial in another.
  • Estimating an expected effect size before data collection is essential for calculating the sample size needed to achieve adequate statistical power.
  • Effect sizes are the currency of meta-analysis; reporting them enables your findings to be synthesized with future studies.
  • Publication bias inflates effect sizes in the literature; pre-registered studies tend to report substantially smaller, more reliable effects.
  • A statistically non-significant result with a large effect size and wide confidence interval may indicate underpowering, not absence of effect.

What Is Effect Size?

Imagine two clinical trials, each testing a new antihypertensive drug. Trial A reports p = 0.04; Trial B reports p = 0.0001. Which drug is better? The answer is: we cannot tell from p-values alone. Trial B might simply have enrolled ten times as many patients, making even a trivial 0.3 mmHg blood pressure reduction reach statistical significance. What we need is effect size.

Effect size is a quantitative measure of the magnitude of a phenomenon: how much of a difference, how strong a relationship, or how large a proportion of variance is explained. Statisticians Jacob Cohen and Gene Glass, two of the most influential methodologists of the twentieth century, famously argued that effect size: not p-values: is the primary product of quantitative research. Cohen wrote: “The primary product of a research inquiry is one or more measures of effect size, not p values.”

Absolute vs. Standardized Effect Size

Effect sizes come in two broad forms:

FeatureAbsolute Effect SizeStandardized Effect Size
DefinitionRaw difference in original measurement unitsUnitless index scaled by variability or range
ExampleMean SBP fell 8 mmHg with drug vs. 3 mmHg with placebo: absolute effect = 5 mmHgCohen’s d = 0.6 (effect is 0.6 standard deviations)
Best used whenThe measurement scale has inherent clinical meaning (blood pressure, body weight, HbA1c)Comparing across studies with different scales, or when the scale lacks intuitive meaning (Likert scores, psychological tests)
LimitationCannot be compared across studies using different scalesRequires knowledge of the population SD; can be misleading if SDs differ across studies

In biomedical research, absolute effect sizes are often preferable for communicating clinical relevance, while standardized effect sizes are indispensable for meta-analyses and cross-study comparisons.

Why the P-Value Is Not Enough

The p-value answers one specific question: given that the null hypothesis is true, how probable is it to observe a result at least as extreme as the one obtained? It does not answer how large, how meaningful, or how clinically important the effect is.

The Sample Size Problem

Statistical significance is mathematically tied to sample size. With a large enough sample, virtually any non-zero difference will be detectable: even one too small to matter in practice.

Example

A study of 50,000 patients compares two statin formulations. One formulation reduces LDL cholesterol by a mean of 0.4 mg/dL more than the other. This minuscule difference might yield p < 0.001: highly “significant”: yet has no meaningful clinical relevance. The p-value cannot distinguish this from a genuinely important finding.

Conversely, a pilot study with n = 15 per group might show a clinically meaningful 20-point improvement in a pain scale but fail to reach p < 0.05, leading a researcher to incorrectly dismiss a potentially important treatment.

Scenariop-value saysEffect size saysCorrect conclusion
Large n (50,000), trivial difference (0.4 mg/dL LDL)p < 0.001 ✓ significantd = 0.02 (negligible)Statistically significant but clinically meaningless
Small n (15/group), large difference (20-pt pain scale)p = 0.12 ✗ not significantd = 0.85 (large)Clinically important; study likely underpowered
Moderate n (200/group), moderate differencep = 0.03 ✓ significantd = 0.30 (small-medium)Statistically and clinically worth noting

Key Differences at a Glance

Propertyp-valueEffect Size
What it measuresProbability of data given null hypothesisMagnitude of the phenomenon
Affected by sample size?Yes: larger n lowers p even for tiny effectsNo: independent of sample size
Communicates clinical importance?NoYes (especially absolute measures)
Required for power analysis?Only partly (sets alpha threshold)Yes: essential input for sample size calculation
Required for meta-analysis?NoYes: the primary unit of synthesis
Sufficient to publish findings?Increasingly, no: journals now require ES reportingIncreasingly mandatory

Types of Effect Size Measures

Choosing the right effect size measure depends on your study design, the type of outcome variable, and the number of groups being compared. The table below is a quick reference; detailed explanations follow.

MeasureStudy Design / Data TypeRangeInterpretation
Cohen’s dTwo-group comparison of continuous means (t-test)0 to ∞ (signed)Number of SDs between group means
Hedge’s gSame as d but corrected for small/unequal samples0 to ∞ (signed)Preferred when n < 20 per group
Pearson’s rCorrelation between two continuous variables−1 to +1Strength and direction of linear relationship
R² / r²Simple or multiple linear regression0 to 1Proportion of variance explained
Eta-squared (η²)One-way ANOVA0 to 1Proportion of total variance due to factor
Partial η² (ηp²)Multi-factor or repeated-measures ANOVA0 to 1Variance due to factor, excluding other factors
Omega-squared (ω²)ANOVA (less biased than η²)0 to 1Population-level variance estimate; preferred for small n
Cohen’s fANOVA (overall)0 to ∞Relates to η²; f = √(η²/1−η²)
Odds Ratio (OR)Binary outcome; case-control, logistic regression0 to ∞OR > 1: increased odds in exposure; OR < 1: decreased odds
Relative Risk (RR)Binary outcome; RCTs and cohort studies0 to ∞RR > 1: increased risk; RR < 1: protective
Absolute Risk Reduction (ARR)Binary outcome; RCTs0 to 1Actual difference in event rates; most clinically direct
Number Needed to Treat (NNT)RCTs; binary outcomes1 to ∞Patients to treat for one additional benefit; lower = better
Hazard Ratio (HR)Survival / time-to-event outcomes0 to ∞HR < 1: reduced hazard in treatment group
Cohen’s wChi-square / goodness-of-fit tests0 to ∞Magnitude of association in contingency tables
Cramér’s VChi-square; nominal categorical data0 to 1Strength of association; 0 = none, 1 = perfect

Calculating Common Effect Size Measures

Cohen’s d (and Hedge’s g)

Cohen’s d is the most widely used standardized effect size for comparing two group means. It is calculated as:

d = (M₁ − M₂) / SD_pooled

Where M₁ and M₂ are the group means and SD_pooled is the pooled standard deviation:

SD_pooled = √[((n₁−1)×SD₁² + (n₂−1)×SD₂²) / (n₁+n₂−2)]

Hedge’s g applies a small-sample correction factor J ≈ 1 − (3 / (4df − 1)), multiplied by d. Use Hedge’s g when group sizes are small (n < 20) or unequal.

Example:

A trial compares a new analgesic to placebo using a 100-mm visual analogue scale (VAS). The drug group mean = 32 mm (SD = 20); placebo mean = 48 mm (SD = 22). Cohen’s d = (48 − 32) / 21 ≈ 0.76, a medium-to-large effect. This tells you the treatment group scored, on average, 0.76 standard deviations better than placebo: a clinically meaningful improvement.

Pearson’s r and R²

Pearson’s r measures the strength and direction of a linear association between two continuous variables. It is directly interpretable as a standardized effect size. R² (the square of r) represents the proportion of variance shared between two variables and is the most commonly reported effect size in regression analyses.

Example:

r = 0.45 between serum CRP and BMI means 20.25% of the variance in CRP is explained by BMI (R² = 0.45² = 0.2025).

Eta-squared, Partial Eta-squared, and Omega-squared (ANOVA)

When comparing more than two groups using ANOVA, the following variance-explained metrics are used:

MeasureFormulaNotes
Eta-squared (η²)η² = SS_effect / SS_totalStraightforward but biased upward; overestimates population effect size
Partial Eta-squared (ηp²)ηp² = SS_effect / (SS_effect + SS_error)Standard output in SPSS; separates out variance from other factors; always ≥ η²
Omega-squared (ω²)ω² = (SS_effect − df_effect × MS_error) / (SS_total + MS_error)Less biased than η²; recommended for smaller samples and for publication

Tip: SPSS reports ηp² by default. Always note which measure you are using when writing up results, and prefer ω² when the sample is small.

Effect Sizes for Binary Outcomes: OR, RR, ARR, NNT, and HR

Binary outcome data (event/no event) are ubiquitous in biomedical research. Multiple effect size measures exist, and they convey quite different things:

MeasureFormulaClinical meaningPitfall
ARRRisk_control − Risk_treatmentAbsolute reduction in event probability; e.g., ARR = 5% means 5 fewer events per 100 patientsIgnores baseline risk; same ARR can mean very different things at different baselines
RRRisk_treatment / Risk_controlProportional change in risk; intuitive for comparisonDoes not communicate absolute magnitude of risk; same RR looks dramatic at high baseline, trivial at low
OROdds_treatment / Odds_controlCommon in case-control studies and logistic regression; approximates RR when events are rareOverestimates RR when events are common (>10%); frequently misread as RR by clinicians
NNT1 / ARRNumber of patients to treat to prevent one event; directly actionable for clinical decision-makingContext-dependent; NNT = 50 for a cheap safe pill vs. invasive procedure mean very different things
HRh(t)_treatment / h(t)_controlInstantaneous relative risk over time; constant-hazard assumption must be checkedAssumes proportional hazards; if hazard functions cross, HR is misleading

Example:

The Physicians Health Study of aspirin for prevention of myocardial infarction found a highly significant result (p < 0.00001). However, the absolute risk reduction was only 0.77 percentage points (1.26% placebo vs. 0.94% aspirin). The NNT was approximately 129: meaning 129 patients needed to take aspirin for one heart attack to be prevented. The p-value alone would have made this seem far more impressive than this absolute benefit suggests.

Interpreting Effect Size: Benchmarks and Context

Cohen’s Conventional Benchmarks

Jacob Cohen (1988) proposed widely adopted benchmarks for interpreting effect size magnitudes. These were explicitly described by Cohen himself as “conventional”: rough defaults in the absence of better field-specific norms, not universal truths.

MeasureSmallMediumLarge
Cohen’s d0.20.50.8
Pearson’s r0.10.30.5
R² / η²0.010.060.14
Cohen’s f (ANOVA)0.100.250.40
Cohen’s w (chi-square)0.100.300.50
OR (clinical trials)~1.5~2.5~4.3

Why You Should Not Blindly Apply Cohen’s Benchmarks

Cohen’s benchmarks have been criticized, including by Cohen himself, for encouraging mechanical interpretation divorced from scientific context. Several important caveats apply:

  • Field norms vary enormously. Effect sizes that are typical in social psychology (d ≈ 0.4–0.6) differ from those in pharmacology, genetics, or educational research. A d of 0.2 is “small” by Cohen’s standard but may represent a clinically important reduction in cardiovascular mortality.
  • Pre-registration matters. Published, non-pre-registered studies tend to report inflated effect sizes due to publication bias and questionable research practices. One systematic review found that effects from pre-registered studies (median r = 0.16) were roughly half the size of those from non-pre-registered studies (median r = 0.36).
  • Cost and risk context matters. An NNT of 100 for a cheap, safe daily pill (e.g., low-dose aspirin) may be acceptable; an NNT of 100 for an expensive drug with significant side effects would not.
  • Outcome importance matters. A small effect on all-cause mortality is inherently more important than a large effect on a surrogate endpoint.
  • Always contextualize. Compare your effect size to prior literature, clinical meaningfulness thresholds, and known benchmarks in your specific field.

The Minimum Clinically Important Difference (MCID)

In biomedical research, a particularly useful benchmark is the Minimum Clinically Important Difference (MCID): the smallest change in an outcome that patients or clinicians would consider meaningful. MCIDs are established empirically for validated clinical instruments:

  • Pain VAS (0–100 mm): MCID ≈ 10–15 mm
  • SF-36 Physical Function score: MCID ≈ 5–10 points
  • FEV₁ in asthma/COPD: MCID ≈ 100–200 mL or 12% change
  • HbA1c: MCID ≈ 0.3–0.5%

When an absolute effect size exceeds the MCID for your outcome, the finding is clinically significant regardless of its standardized effect size category. Always search the literature for the MCID of your primary outcome before interpreting results.

Effect Size, Statistical Power, and Sample Size

Effect size plays a central role in study planning. Before beginning a study, researchers must estimate the expected effect size to determine how many participants are needed to achieve adequate statistical power.

The Power Relationship

Statistical power (1 − β) is the probability of correctly detecting a true effect. Power depends on four interrelated parameters: changing any one affects the others:

ParameterRoleTypical convention
Alpha (α): significance thresholdProbability of a false positive (Type I error)0.05
Power (1 − β)Probability of correctly detecting a true effect0.80 (sometimes 0.90)
Sample size (n)Number of participants per groupCalculated from the other three
Effect size (ES)Expected magnitude of the effectEstimated from prior literature or pilot data

The key insight: to detect a small effect (d = 0.2), you need a much larger sample than to detect a large effect (d = 0.8). Below are approximate sample sizes required per group for 80% power (α = 0.05, two-tailed independent samples t-test):

Effect Size (d)LabelN per group needed
0.2Small~394
0.5Medium~64
0.8Large~26
1.0Very large~17

Failure to perform a proper power analysis using a realistic effect size estimate is one of the most common methodological failings in preclinical and clinical research. Underpowered studies waste resources and frequently yield unreliable results.

Where to Get the Expected Effect Size for Power Analysis

Estimating the expected effect size before a study is critical but can be challenging. Recommended approaches include:

  • Systematic review or meta-analysis: The best source. Use the pooled effect size from high-quality studies in your area.
  • Previous studies in similar populations: Use with caution: the literature overestimates effect sizes due to publication bias; consider using a conservative (smaller) estimate.
  • Pilot data: Useful but imprecise; pilot effect sizes are themselves estimates with wide confidence intervals.
  • MCID-based estimation: Set the expected effect size equal to the MCID expressed in standard deviation units (i.e., MCID / SD_population) to ensure you are powered to detect a clinically meaningful difference.
  • Field benchmarks: Use Cohen’s “small” (d = 0.2) as a conservative default if no other information is available, to avoid an underpowered study.

How to Report Effect Size

Major journals, reporting guidelines (CONSORT, STROBE, APA), and statistical bodies now require effect sizes to be reported. A complete results report includes the effect size estimate, its confidence interval, and the p-value.

Reporting Checklist

  • State which effect size measure you used and why it was appropriate for your design.
  • Report the point estimate and 95% confidence interval for the effect size.
  • Report effect size in the Abstract, not just the body of the paper.
  • For standardized measures, also report the raw (unstandardized) difference if the scale is clinically interpretable.
  • Interpret the effect size in the context of your field: do not rely solely on Cohen’s benchmarks.
  • If comparing multiple outcomes, report effect sizes for all primary and secondary outcomes, not selectively.

Example Reporting Sentences

The following templates illustrate best-practice reporting:

  • Comparing two means: “Patients in the intervention group showed a significantly lower mean diastolic blood pressure than controls (68.4 ± 8.2 vs. 74.1 ± 9.0 mmHg; mean difference = −5.7 mmHg, 95% CI [−8.1, −3.3]; Cohen’s d = 0.67, p < 0.001).”
  • Binary outcomes: “Treatment reduced the incidence of major cardiovascular events from 12.4% to 8.1% (ARR = 4.3%, 95% CI [2.1%, 6.5%]; RR = 0.65, 95% CI [0.50, 0.85]; NNT = 23; p = 0.002).”
  • ANOVA: “A one-way ANOVA revealed a significant effect of treatment group on inflammatory marker levels (F(2,117) = 8.43, p < 0.001, ω² = 0.12), indicating a medium-to-large effect.”
  • Correlation: “There was a moderate positive correlation between serum ferritin and fatigue scores (r = 0.38, 95% CI [0.24, 0.51], p < 0.001), with ferritin explaining approximately 14% of the variance in fatigue (R² = 0.14).”

Confidence Intervals for Effect Sizes

Just as a point estimate of a mean is reported with a CI, so too should effect sizes. The CI communicates the precision of the effect estimate. A wide CI (e.g., d = 0.6, 95% CI [0.1, 1.1]) indicates high uncertainty, possibly due to small sample size. A narrow CI (e.g., d = 0.6, 95% CI [0.5, 0.7]) indicates high precision.

CIs for common measures can be calculated using dedicated software (R packages: effectsize, MBESS; SPSS; G*Power for power analysis) or online calculators.

Effect Size in Meta-Analysis

Effect size is the fundamental unit of meta-analysis. When multiple studies have examined the same question, a meta-analysis statistically pools their effect sizes to produce a single, more precise combined estimate. This requires that each contributing study report a standardized effect size (usually Cohen’s d, Hedge’s g, OR, or r) along with a measure of variance (typically the standard error or 95% CI).

Key points about effect sizes in meta-analysis:

  • Studies with larger sample sizes are given more weight in the pooled estimate, reducing the influence of small, noisy studies.
  • Heterogeneity (variability in effect sizes across studies, quantified by I² and τ²) signals that context moderates the effect and that a single pooled estimate may be misleading.
  • Publication bias tends to inflate pooled effect sizes in meta-analyses: funnel plots and tests such as Egger’s test can help detect this.
  • Not reporting effect sizes in primary studies directly harms cumulative science: your study cannot be synthesized if its effect size cannot be extracted.

Special Considerations in Biomedical Research

Preclinical (Animal) Research

Effect size reporting in preclinical research is far less common than in clinical research, despite its importance. Studies in rodent models are frequently underpowered, and the lack of power calculations using realistic effect sizes contributes to poor reproducibility of preclinical findings.

  • A systematic review of the rodent fear-conditioning literature found median effect sizes much larger than what could be confirmed in replication, consistent with publication bias and small underpowered studies.
  • Preclinical researchers should calculate and report standardized mean differences (Cohen’s d or Hedge’s g) for all primary outcomes.
  • Expressing preclinical effects as percentage change relative to control (an absolute measure) helps translate findings into clinically interpretable language.

Epidemiology and Observational Studies

In epidemiology, OR, RR, HR, and their absolute counterparts (ARR, attributable risk) are the standard effect size measures. Important considerations:

  • Odds Ratio vs. Relative Risk confusion: The OR overestimates the RR when outcome prevalence is above 10%. In studies with common outcomes, RR (or risk difference) is preferable for clinical communication.
  • Hazard Ratio assumption: The Cox proportional hazards model assumes that the ratio of hazards is constant over time (proportional hazards assumption). Violation of this assumption makes the HR difficult to interpret: always check it.
  • Absolute vs. relative framing: Drug companies often report relative risk reductions (“50% reduction in events!”) which can be misleading when baseline risks are low. Always request and report the ARR and NNT.

Clinical Trials

CONSORT reporting guidelines for randomized controlled trials require reporting of estimated effect sizes with confidence intervals. Specific guidance:

  • For continuous outcomes: report mean difference and 95% CI, plus Cohen’s d if cross-study comparison is intended.
  • For binary outcomes: report at minimum ARR and RR (or OR), both with 95% CIs. Report NNT with its confidence interval.
  • For time-to-event outcomes: report hazard ratio with CI, and verify the proportional hazards assumption.
  • For non-inferiority trials: the effect size and its CI must be interpreted relative to the pre-specified non-inferiority margin, not just the null.

Common Mistakes and Misconceptions

MistakeWhy it’s wrongWhat to do instead
“p < 0.05 means the effect is large”p is a function of both effect size AND sample size. Large n can make trivial effects significant.Always report effect size separately from p-value.
“p > 0.05 means no effect”Failure to detect ≠ absence of effect. The study may be underpowered.Report effect size with CI; a large ES with p > 0.05 suggests underpowering, not zero effect.
“The smaller the p, the larger the effect”p and effect size are mathematically independent once n is accounted for.Use effect size to describe magnitude; use p to describe evidence against null.
Reporting η² when ω² is more appropriateη² overestimates the population effect size, especially in small samples.Use ω² or partial ω² for ANOVA, especially when n < 100.
Conflating OR and RROR exaggerates RR when event prevalence is high; readers often misinterpret OR as RR.Report RR when possible; if OR is used, note the baseline event rate so readers can judge whether it approximates RR.
Using Cohen’s benchmarks uncriticallySmall/medium/large are context-free defaults, not clinical standards.Interpret effect size against MCID, field norms, and clinical relevance.
Reporting effect size without a CIA point estimate alone conceals the uncertainty of the estimate.Always report the 95% CI for any effect size.
Selective reporting of effect sizes for significant outcomes onlyInflates the apparent magnitude of effects in the literature.Report effect sizes for all pre-specified primary and secondary outcomes, regardless of significance.

Software and Tools for Effect Size Calculation

Software / PackagePlatformKey Capabilities
G*PowerWindows / macOS (free)Power analysis and sample size calculation for all major test types; requires expected effect size as input
effectsize (R package)R (free)Comprehensive; calculates d, g, r, η², ω², OR, and more; integrates with common R test output
MBESS (R package)R (free)Confidence intervals for effect sizes; particularly strong for ANOVA measures
ESCI (Excel / R / JASP)Multiple (free)Estimation-focused; displays effect sizes with CIs graphically; excellent for teaching
JASPWindows / macOS / Linux (free)GUI-based; outputs η², ω², d, and Bayesian effect sizes automatically
SPSSCommercialReports partial η² by default for ANOVA; Cohen’s d available via COMPARE MEANS; must manually request most ES measures
StataCommercialesize command; strong for OR, RR, ARR, NNT in clinical trial contexts
Python (pingouin library)FreeComprehensive effect size calculations integrated with statistical tests
meta (R) / RevMan (Cochrane)R (free) / freeMeta-analysis; pools effect sizes across studies; funnel plots; heterogeneity tests

Frequently Asked Questions

1. My result is not statistically significant (p = 0.08) but the effect size looks clinically important (d = 0.55). Should I still report it?

Absolutely, and this scenario is arguably more important to report carefully than a significant result. A non-significant result with a moderate effect size almost certainly reflects an underpowered study, not the absence of an effect. Reporting the effect size and its confidence interval (e.g., d = 0.55, 95% CI [−0.01, 1.11]) tells future researchers and meta-analysts that there is a signal worth investigating with an adequately powered study. The phrase “no significant effect was found” without an effect size is uninformative and potentially misleading. The correct interpretation is: “We did not have sufficient power to detect a statistically significant effect; however, the observed effect size of d = 0.55 suggests the effect may be clinically meaningful and warrants further investigation.”

2. My supervisor insists that reporting a p-value is sufficient. How do I argue for reporting effect sizes?

Point to the requirements of major journals (JAMA, BMJ, NEJM, Nature Medicine, and APA-style journals all require effect sizes) and reporting guidelines (CONSORT for RCTs, STROBE for observational studies). The American Statistical Association’s 2016 statement explicitly warns against using p < 0.05 as the sole basis for conclusions. You can also use a concrete example: a study with n = 100,000 finds that a new diet reduces BMI by 0.05 kg/m², with p = 0.0001. This is statistically significant but has no clinical meaning. Cohen himself called the p-value “the least interesting thing about the results.”

3. I ran a chi-square test on a 2×3 contingency table. What effect size should I report?

For chi-square tests, Cramér’s V is the appropriate standardized effect size. It ranges from 0 (no association) to 1 (perfect association) and accounts for table dimensions. Cohen’s w is equivalent to Cramér’s V for 2×2 tables. For a 2×3 table, Cramér’s V = √(χ² / n × (min(r,c)−1)), where r and c are the number of rows and columns. In R: cramer_v() from the effectsize package. In SPSS: request it under Crosstabs > Statistics > Phi and Cramér’s V.

4. I often see papers reporting partial eta-squared (ηp²) from SPSS ANOVA output. Is this the same as eta-squared?

No, and the distinction matters. Eta-squared (η²) is calculated as SS_effect / SS_total, where SS_total includes variance from all sources in the design (other factors, interactions, and error). Partial eta-squared (ηp²) removes the other factors from the denominator: SS_effect / (SS_effect + SS_error). This means ηp² will always be equal to or larger than η², sometimes substantially so in multi-factor designs. SPSS reports ηp² by default and labels it “Partial Eta Squared”: always check which one software outputs. For a within-subjects design with multiple factors, summing all reported ηp² values can exceed 1.0, which is a red flag that partial values are being summed incorrectly. For most purposes, omega-squared (ω²) or partial omega-squared (ωp²) is preferred as a less-biased estimator.

5. I keep seeing huge effect sizes in my field’s literature (e.g., d > 1.5 routinely). Does that mean my field has exceptionally strong effects?

Almost certainly not: this is likely a symptom of publication bias and small, underpowered studies. Research consistently shows that published effect sizes in many fields are inflated. Small studies that detect very large effects are those most likely to be published (because they reach significance). The many small studies that found small or no effects are suppressed (the “file drawer problem”). Studies with pre-registration show effect sizes roughly half those of non-pre-registered studies in the same field. When planning your own study, it is safer to use a conservative (smaller) effect size estimate from pre-registered work or the lower bound of a meta-analytic confidence interval, rather than the typical published effect size.

6. I read that an NNT of 50 is considered “large” (i.e., a weak treatment). But in cancer prevention, an NNT of 50 is often considered excellent. What’s going on?

NNT is inherently context-dependent and cannot be evaluated in isolation from the severity of the disease, the cost and side-effect profile of the treatment, and the alternatives available. For a cheap, safe, once-daily aspirin that prevents a fatal MI (NNT ≈ 129), that NNT is considered worthwhile. For an expensive immunotherapy with significant toxicity that prevents a recurrence of a deadly cancer (NNT ≈ 8), the benefit-to-risk calculus is completely different. There is no universal threshold for a “good” NNT. Always interpret NNT alongside NNH (Number Needed to Harm), baseline event rates, treatment costs, and patient preferences.

7. Can I convert between effect size measures? For example, can I convert an odds ratio from a paper into Cohen’s d to compare it with another paper that reports d?

Yes, and there are published formulas for common conversions. The most frequently used conversion between OR and d is: d = ln(OR) × (√3 / π) ≈ ln(OR) × 0.5513. This approximation assumes a logistic distribution. For r to d: d = 2r / √(1 − r²). For d to r: r = d / √(d² + 4). These conversions are useful for meta-analyses that need to pool studies using different effect size metrics. Many software packages (e.g., the esc R package, or online converters such as Psychometrica) can perform these conversions automatically. Be aware that conversions assume certain distributional properties and may introduce additional uncertainty.

8. My study has multiple outcomes and some have large effect sizes while others are small. Should I pick the largest one to report as my “main” finding?

No: this practice, known as outcome switching or cherry-picking, is a form of p-hacking and inflates the literature. Your primary outcome should be pre-specified in the study protocol (or in a pre-registration), and the effect size for that outcome should be your headline result regardless of its size. Secondary outcomes with large effect sizes can be reported and highlighted, but should be clearly labelled as hypothesis-generating or exploratory. Journals increasingly check submitted papers against trial registries to flag unregistered primary outcomes. If you find unexpectedly large effects on secondary outcomes, the correct approach is to report them transparently, note that they were not pre-specified, and design a new adequately powered confirmatory study to test those outcomes as primaries.

Further Reading

The following key works informed this guide and are recommended for deeper study:

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Sullivan, G. M., & Feinn, R. (2012). Using effect size: or why the P value is not enough. Journal of Graduate Medical Education, 4(3), 279–282.
  • Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863.
  • Cumming, G. (2012). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. Routledge.
  • Ellis, P. D. (2010). The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results. Cambridge University Press.
  • Kelley, K., & Preacher, K. J. (2012). On effect size. Psychological Methods, 17(2), 137–152.

Related post

Featured post

Comment

There are no comment yet.

TOP