Correlation analysis: Types, when and how to conduct it
In this article, you’ll learn
- What is correlation analysis?
- What are correlation coefficients?
- Types of correlations
- When to use correlation analysis in biomedical research
- Assumptions for correlation analysis
- Pearson correlation vs Spearman correlation: When to use each
- What is the difference between correlation and regression?
- Data collection methods for correlational research
- Confounding variables in correlation analysis
- Uses of correlation analysis in biomedical research
- Five key precautions for correlation analysis
- How to report correlation results: Example format
- Frequently Asked Questions
What is correlation analysis?
Correlation analysis is a statistical method that helps biomedical researchers uncover relationships between two or more variables in their data. It determines whether a connection or association exists between different factors, such as the correlation between two biomarkers, the relationship between a treatment and patient outcomes, or the interplay between genetic factors and disease risk.
The primary goal is to assess the strength and direction of the relationship between variables. Importantly, correlation analysis identifies whether variables are related, but it does NOT establish whether one variable causes changes in the other.
What are correlation coefficients?
Researchers use a correlation coefficient to quantify the relationship between variables. The most commonly used are:
- Pearson correlation coefficient (r) – used for continuous parametric data
- Spearman’s rank correlation coefficient (rho or ρ) – used for continuous non-parametric data or ordinal data
Both coefficients range from -1 to +1:
| Correlation Coefficient Value | Meaning |
| +1.0 | Perfect positive correlation: as one variable increases, the other always increases proportionally |
| +0.7 to +0.99 | Strong positive correlation |
| +0.3 to +0.69 | Moderate positive correlation |
| 0 to +0.29 | Weak positive correlation |
| 0 | No correlation: variables are unrelated |
| -0.29 to -0.01 | Weak negative correlation |
| -0.3 to -0.69 | Moderate negative correlation |
| -0.7 to -0.99 | Strong negative correlation |
| -1.0 | Perfect negative correlation: as one variable increases, the other always decreases proportionally |
Note: These ranges are conventional guidelines and may vary by research field and context.
Types of correlations
Correlations can be classified in three ways:
Positive and negative correlation
| Type | Definition | Example |
| Positive correlation | As one variable increases, the other also increases | Blood pressure and risk of stroke; drug dose and therapeutic effect |
| Negative correlation | As one variable increases, the other decreases | Exercise frequency and body weight; treatment adherence and hospitalization rate |
| No correlation | Change in one variable has no effect on the other | Eye color and blood type |
Linear and non-linear correlations
| Type | Definition | Example |
| Linear correlation | Constant rate of change in one variable relative to another; appears as a straight line on a scatter plot | Height and weight; age and cholesterol level |
| Non-linear correlation | Inconsistent rate of change; the relationship changes across the range of values | Medication dose and therapeutic benefit (may plateau at higher doses); temperature and enzyme activity |
Simple, multiple, and partial correlations
| Type | Definition | When Used |
| Simple correlation | Examines relationship between only two variables | Smoking status and lung cancer risk |
| Multiple correlation | Examines relationship between three or more variables simultaneously | How rainfall, fertilizer quality, and sunlight together affect crop yield |
| Partial correlation | Examines relationship between two variables while controlling for other variables | Effect of age on disease severity while controlling for genetic factors and comorbidities |
When to use correlation analysis in biomedical research
Use correlation analysis when:
- You want to identify whether an association exists between variables without manipulating them
- Variables cannot be controlled or manipulated for ethical, practical, or feasibility reasons (e.g., studying effects of disease on patients)
- You need to understand natural relationships in real-world settings
- You want to generate hypotheses for future experimental research
- Your research question focuses on “Is there a relationship?” rather than “Does one variable cause the other?”
Do NOT use correlation analysis when:
- Your goal is to establish causation
- You can conduct an experimental study where you manipulate variables
- You need to predict outcomes based on multiple independent variables (consider regression analysis instead)
- Your data violates the assumptions required for your chosen correlation method
Assumptions for correlation analysis
The validity of correlation analysis depends on meeting specific statistical assumptions. The requirements differ based on your data type:
Assumptions for Pearson’s correlation (r)
Use Pearson correlation only when all of these are true:
- Both variables are continuous and measured on an interval or ratio scale
- Data follows a normal distribution for both variables
- The relationship between variables is linear (check with scatter plot)
- There are no significant outliers distorting the relationship
- Sample size is adequate (generally n > 30 recommended)
Assumptions for Spearman correlation (ρ)
Use Spearman correlation when:
- Data is ordinal, ranked, or continuous but non-normally distributed
- The relationship may be non-linear
- You have smaller sample sizes or significant outliers
- You prefer a non-parametric approach
Pearson correlation vs Spearman correlation: When to use each
| Characteristic | Pearson (r) | Spearman (ρ) |
| Data type required | Continuous, interval or ratio scale | Ordinal, ranked, or continuous |
| Distribution required | Normal distribution | No normality requirement |
| Relationship type | Linear only | Linear or non-linear |
| Sensitivity to outliers | High (affected by extreme values) | Low (uses ranks, not raw values) |
| Sample size flexibility | Better with larger samples | Works with smaller samples |
| Parametric or non-parametric | Parametric | Non-parametric |
| Biomedical example | Correlation between body weight and blood pressure | Correlation between pain scale ranking and mobility rating |
What is the difference between correlation and regression?
Correlation and regression both examine relationships between variables but serve different purposes:
| Aspect | Correlation Analysis | Regression Analysis |
| Main purpose | Determine if relationship exists between variables | Predict value of dependent variable from independent variable(s) |
| Direction of relationship | Symmetric (no causal direction implied) | Directional (independent variable predicts dependent variable) |
| Type of variables | Both variables treated equally | Distinguishes between predictor and outcome |
| What it measures | Strength and direction of relationship | How much change in one variable causes change in another |
| Prediction ability | Limited prediction capability | Designed for prediction |
| Complexity | Simple, two or more variables | Can handle multiple variables easily |
| Output | Correlation coefficient (r or ρ) | Regression equation; R-squared value |
| When to use | Exploratory analysis, hypothesis generation | When you want to predict outcomes |
Example:
Correlation analysis asks “Do hours studied and exam scores relate?” Regression analysis asks “Can we predict exam score from hours studied, and by how much?”
Data collection methods for correlational research
Since variables are not manipulated, data can be collected through multiple methods:
| Method | How It Works | Advantages | Disadvantages | Example |
| Naturalistic observation | Observe and record variables in their natural setting without intervention | Captures real-world behavior; realistic results; no artificial conditions | Cannot control variables; time-consuming; risk of researcher bias | Observing medication adherence patterns in patients at a clinic |
| Surveys and questionnaires | Participants complete surveys about variables of interest | Large sample sizes possible; cost-effective; quick data collection | Response bias; poorly designed questions affect results; unrepresentative sample | Questionnaire correlating stress levels with sleep quality in healthcare workers |
| Archival/secondary data | Analyze existing records, databases, or historical data | Free or low-cost; large datasets; long-term trend data; no participant burden | May be incomplete or unreliable; limited control over what was measured; data may not match your research question exactly | Using hospital records to correlate hospital stay duration with infection rates over 5 years |
Confounding variables in correlation analysis
A confounding variable is a third factor that influences both variables you are studying, creating a false or misleading correlation.
Example:
- You find a strong positive correlation between ice cream sales and drowning deaths across months of the year. It appears ice cream causes drowning. However, the confounding variable is seasonal temperature. Warmer weather (temperature) causes both higher ice cream sales AND more people swimming, which increases drowning risk. Temperature is the true underlying cause.
- A study finds correlation between coffee consumption and heart disease risk. However, the confounding variable may be smoking: people who drink more coffee are more likely to smoke, and smoking (not coffee) causes heart disease.
Why confounders matter:
When confounding variables exist, the observed correlation does not reflect the true relationship between your two variables of interest. This is why correlation does not imply causation.
How to handle confounding variables:
- Identify potential confounding variables in your study design
- Measure confounding variables and report them
- Use statistical methods like partial correlation to control for their effects
- Acknowledge limitations in your results
- Recommend further experimental research to establish causation
Uses of correlation analysis in biomedical research
Biomedical researchers employ correlation analysis for:
- Investigating associations between risk factors and disease (smoking and lung cancer, blood pressure and heart disease severity)
- Assessing relationships between biomarkers and disease progression
- Exploring connections between genetic factors and disease risk
- Identifying potential diagnostic or prognostic biomarkers
- Understanding treatment response patterns
- Examining relationships in health behavior studies
- Analyzing relationships in epidemiological data
- Generating hypotheses for further experimental investigation
By understanding these relationships, researchers can identify potential biomarkers, risk factors, or treatment strategies crucial for advancing disease understanding, optimizing patient care, and developing new therapies.
Five key precautions for correlation analysis
1. Direction Matters
- Always describe correlation as positive or negative unless reporting the correlation coefficient (which includes the minus sign if negative)
- Example: Say “a negative correlation exists” or “r = -0.52” but not “a negative correlation of weak strength”
2. Be precise about strength
- Report exact correlation coefficient values in your abstract (e.g., r = 0.68) rather than vague terms like “strong” or “weak”
- In the Methods section, define your classification system: specify what ranges you consider strong, moderate, and weak (e.g., “r values of 0.7-1.0 were considered strong”)
- Document whether your strength classifications follow conventional guidelines or are field-specific
3. Assumptions must be met
- The type of data you have determines which correlation analysis you should run
- For continuous parametric variables with normal distribution → Pearson’s r
- For continuous non-parametric variables or ordinal data → Spearman’s rho
- Check and document that your data meets the required assumptions
- Report any violations of assumptions and how you addressed them
4. Presentation accuracy matters
- The “r” in Pearson’s r is always lowercase
- The “ρ” in Spearman’s ρ is the Greek letter rho, NOT the English letter “p”
- Always include the correlation coefficient value
- Report p-values to indicate statistical significance (e.g., r = 0.65, p < 0.001)
- Use correct statistical notation throughout your manuscript
5. Correlation does not imply causation
- Just because two variables correlate does NOT mean one causes the other
- Confounding variables may create the appearance of correlation
- The relationship could be coincidental
- Multiple competing explanations may exist for an observed correlation
- Never draw causal conclusions from correlation analysis alone
- Recommend controlled experimental studies to establish causation
- Acknowledge alternative explanations for the correlation in your discussion
How to report correlation results: Example format
When presenting correlation findings in your biomedical paper:
Abstract example:
“A moderate positive correlation was found between patient age and disease severity (r = 0.54, p < 0.001, 95% CI [0.42, 0.64]).”
Methods section example:
“Pearson correlation coefficients were calculated to assess the relationship between continuous variables. Correlation strength was classified as weak (r < 0.3), moderate (0.3 ≤ r < 0.7), or strong (r ≥ 0.7). Statistical significance was set at p < 0.05. Spearman’s rho was used for variables not meeting normality assumptions as assessed by Shapiro-Wilk test.”
Results section example:
“Systolic blood pressure showed a strong positive correlation with left ventricular mass (r = 0.72, p < 0.001). Body mass index was moderately correlated with insulin resistance (ρ = 0.58, p = 0.002), and this relationship remained significant when controlling for age as a confounding variable (partial r = 0.51, p = 0.008).”
Frequently Asked Questions
Q1: If I find a correlation, does that mean one variable causes the other?
A: No. Correlation identifies that a relationship exists between two variables, but it does not establish causation. Three explanations exist for any observed correlation: (1) Variable A causes Variable B, (2) Variable B causes Variable A, or (3) A confounding third variable causes both.
For example, a study might find correlation between hospital admissions and ice cream sales, but seasonal temperature is the confounding variable causing both. Only controlled experimental studies can definitively establish causation.
Q2: How large should my sample size be to calculate a valid correlation?
A: A minimum sample size of 30 is generally recommended for Pearson correlation, though larger samples (n > 100) are preferred for more reliable estimates, especially if your data may have outliers. Spearman correlation can work with smaller samples. The required sample size depends on the expected correlation strength: smaller correlations require larger samples to detect. Use statistical power analysis software to calculate the specific sample size needed for your research based on your expected effect size and desired statistical power.
Q3: What if my data is not normally distributed?
A: If your data violates the normality assumption required for Pearson correlation, use Spearman’s rho (rank correlation) instead. Spearman correlation does not assume normal distribution and is more robust to outliers since it ranks data rather than using raw values. Alternatively, you can transform your data (e.g., log transformation) to approach normality, though this changes interpretation. Always check assumptions and report which method you used and why.
Q4: How do I interpret a correlation coefficient of 0.35?
A: A correlation of r = 0.35 indicates a weak to moderate positive relationship between the two variables. Using conventional guidelines, this would typically be classified as weak (< 0.3) to moderate (0.3-0.7) depending on your field. However, the interpretation depends on context: in some biomedical fields, even a weak correlation may be clinically meaningful. Always report the exact value, the p-value (statistical significance), and the confidence interval rather than just stating “weak” or “strong.” Discuss what the correlation means in practical terms for your research.
Q5: Can I use correlation analysis to predict future patient outcomes?
A: Correlation analysis alone is not designed for prediction. While correlation identifies relationships, regression analysis is better suited for making predictions. Regression analysis builds an equation that estimates how changes in predictor variables relate to changes in an outcome variable. If your goal is prediction (e.g., “Can we predict patient recovery time from their initial severity score?”), use linear or logistic regression instead. Correlation analysis is better for exploratory research and hypothesis generation.
References
- The BMJ Statistics at Square One. (n.d.). Correlation and regression. https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression
- Australian Bureau of Statistics. (n.d.). Correlation and causation. https://www.abs.gov.au/statistics/understanding-statistics/statistical-terms-and-concepts/correlation-and-causation
- National Library of Medicine. (n.d.). Methods for correlational studies. In: Handbook of eHealth Evaluation: An Evidence-based Approach. https://www.ncbi.nlm.nih.gov/books/NBK481614/




