2023.04.06
2026.07.20

What is Regression Analysis: Types, assumptions, modeling, steps

Contents

Key Takeaways
Glossary of Key Terms
What Is Regression Analysis?
What Assumptions Must Regression Models Satisfy?
Types of Regression Analysis
Regression Analysis in Biomedical Research
Regression Analysis in the Social Sciences
What Do Regression Outputs Mean, and How Should They Be Reported?
Overfitting: What It Is and How to Avoid It
Regression in Machine Learning: A Brief Overview
How Do You Choose the Right Regression Model?
Software for Regression Analysis
Tips for Your First Regression Analysis
Frequently Asked Questions
References

Key Takeaways

Regression analysis is a statistical method that models the relationship between a dependent (outcome) variable and one or more independent (predictor) variables.
The choice of regression type depends on the nature of the outcome variable, the number of predictors, and whether the relationship is assumed to be linear or nonlinear.
Simple linear regression examines one predictor; multiple linear regression examines two or more predictors simultaneously while controlling for each.
Logistic regression is the method of choice in biomedical research when the outcome is binary (e.g., disease present or absent, survived or died).
Cox proportional hazards regression is the standard approach for survival analysis and time-to-event data in clinical and epidemiological studies.
Poisson regression is used for count outcomes such as the number of hospital admissions, disease episodes, or incidents per unit of time.
Ordinal logistic regression is appropriate when the dependent variable has ordered categories (e.g., severity scales such as mild, moderate, severe).
Polynomial and nonlinear regression extend standard regression to handle curved relationships between variables.
Ridge, Lasso, and Elastic Net regression are regularization methods that reduce overfitting when many predictors are present.
All regression models rest on a set of assumptions (linearity, independence of errors, homoscedasticity, normality of residuals); violation of these assumptions can produce biased or unreliable results.
In social sciences, regression is used to study income inequality, educational outcomes, voting behavior, and many other phenomena involving multiple social determinants.
Regression describes association, not causation; causal inference requires additional design considerations such as randomization or quasi-experimental methods.
Key outputs to report include: regression coefficients (with confidence intervals), p-values, R-squared (or pseudo R-squared for non-linear models), and model fit statistics such as AIC or BIC.
Overfitting occurs when a model fits the training data too closely; it is best detected through cross-validation or holdout testing.
Multicollinearity (high correlation among predictors) inflates standard errors and makes coefficient estimates unstable; Variance Inflation Factor (VIF) is the standard diagnostic.

Glossary of Key Terms

The following terms appear throughout this guide. Familiarity with them before reading the main text will aid comprehension.

Term	Definition
Dependent Variable (Outcome)	The variable being predicted or explained by the model; also called the response variable or Y.
Independent Variable (Predictor)	The variable(s) used to predict or explain the outcome; also called explanatory variables or covariates.
Regression Coefficient (Beta)	A numeric value indicating the estimated change in the outcome variable for a one-unit increase in the predictor, holding all other predictors constant.
Intercept	The predicted value of the outcome when all predictors equal zero; corresponds to the point where the regression line crosses the Y-axis.
Residual	The difference between the observed value and the value predicted by the regression model; also called an error term.
R-squared (R2)	The proportion of variance in the outcome variable explained by the model, ranging from 0 to 1; a higher value indicates a better fit.
Adjusted R-squared	A modified version of R-squared that penalizes the addition of predictors that do not improve model fit; preferred in multiple regression.
p-value	The probability of observing the data (or more extreme data) if the null hypothesis were true; conventionally, p < 0.05 is considered statistically significant.
Confidence Interval (CI)	A range of values within which the true population parameter is estimated to lie with a specified level of confidence (commonly 95%).
Ordinary Least Squares (OLS)	The most common method for estimating regression coefficients; minimizes the sum of the squared residuals.
Maximum Likelihood Estimation (MLE)	A method used to estimate parameters in logistic and other generalized linear models by finding the parameter values most likely to have produced the observed data.
Odds Ratio (OR)	In logistic regression, the exponentiated regression coefficient; indicates how the odds of the outcome change per unit increase in the predictor.
Hazard Ratio (HR)	In Cox regression, the ratio of the hazard rate in one group compared to another; analogous to the relative risk for time-to-event data.
Multicollinearity	A condition where two or more predictors in a regression model are highly correlated with each other, leading to unstable coefficient estimates.
Variance Inflation Factor (VIF)	A diagnostic statistic for multicollinearity; values above 5 to 10 are generally considered problematic.
Homoscedasticity	The assumption that the variance of residuals is constant across all levels of the predictor variables; the opposite condition is heteroscedasticity.
Overfitting	A condition in which a model fits the training data too closely, capturing noise rather than true signal, resulting in poor generalization to new data.
Regularization	A technique that adds a penalty term to the estimation criterion to prevent overfitting; examples include Ridge (L2) and Lasso (L1) penalties.
AIC / BIC	Akaike Information Criterion and Bayesian Information Criterion: measures of model fit that penalize complexity; lower values indicate a better model.
Heteroscedasticity	Unequal variance of residuals across levels of the predictors; violates the OLS assumption of constant variance.
Categorical Variable	A variable with discrete, unordered categories (e.g., sex, ethnicity); must be converted to dummy variables before entry into most regression models.
Dummy Variable	A binary (0/1) variable created to represent membership in a category; a categorical variable with k categories requires k-1 dummy variables.
Interaction Term	A variable created by multiplying two predictors together; tests whether the effect of one predictor on the outcome depends on the value of another predictor.
Survival Analysis	A set of statistical techniques for analyzing time-to-event data where the event (e.g., death, relapse) may not have occurred for all subjects.
Censoring	In survival analysis, a subject is censored if the event has not occurred by the end of the study period or they withdrew from follow-up.
Pseudo R-squared	Analogues of R-squared for logistic or other generalized linear models; examples include Nagelkerke R-squared and McFadden R-squared.
Cross-validation	A model evaluation technique in which data are repeatedly split into training and test sets to obtain unbiased estimates of model performance.
Structural Equation Modeling (SEM)	An advanced framework that combines factor analysis and regression to model both direct and indirect pathways among variables.

Regression analysis is a family of statistical methods that models the relationship between a dependent variable (the outcome of interest) and one or more independent variables (predictors). The core purpose is twofold: to understand and quantify how changes in predictor values are associated with changes in the expected value of the outcome, and to generate predictions for new observations based on those learned relationships.

The term ‘regression’ was coined by Francis Galton in the nineteenth century to describe his observation that the heights of children of very tall or very short parents tended to ‘regress’ toward the population mean. Today the term encompasses a broad family of models used across virtually every scientific discipline.

In practical research, regression serves three principal functions:

Description: quantifying the direction and magnitude of associations between variables (e.g., the effect of years of education on income).
Prediction: generating expected outcome values for new cases or scenarios based on model coefficients.
Causal inference (with caveats): when combined with appropriate study designs, regression can inform causal claims, though association alone does not establish causation.

The Basic Regression Equation

The simplest regression model takes the form:

Y = a + bX + e

Where:

Y is the dependent variable (outcome)
a is the intercept (value of Y when X = 0)
b is the regression coefficient (slope): the expected change in Y for each one-unit increase in X
X is the independent variable (predictor)
e is the error term (residual): the portion of Y not explained by X

In multiple regression, additional predictors (X2, X3, …, Xk) are added alongside their respective coefficients (b2, b3, …, bk). Each coefficient reflects the relationship between that predictor and the outcome while holding all other predictors constant.

How Is Regression Different From Correlation?

Correlation measures the strength and direction of a linear association between two variables, producing a single coefficient (r) that ranges from -1 to +1. Regression goes further: it estimates a predictive equation, assigns specific coefficient values to each predictor, and allows for multiple predictors simultaneously. While high correlation is often a prerequisite for meaningful regression, correlation does not indicate which variable influences the other, and it cannot accommodate control for confounding variables. Regression can accomplish both.

Regression and Causation: An Important Distinction

A regression coefficient tells a researcher how much the average value of the outcome changes per unit change in the predictor, within the observed data. It does not, by itself, establish that the predictor caused the change. Causal inference additionally requires: a plausible biological or theoretical mechanism, temporal ordering (the cause precedes the effect), elimination of confounding, and ideally a design feature such as randomization, an instrumental variable, or a natural experiment. Observational regression studies in epidemiology and social science must interpret coefficients as associations unless these additional conditions are met.

What Assumptions Must Regression Models Satisfy?

All regression models rest on a set of statistical assumptions. When these are violated, coefficient estimates may be biased, standard errors may be incorrect, and p-values and confidence intervals will be unreliable. The specific assumptions vary by model type, but the following apply to ordinary least squares (OLS) linear regression and inform many related approaches.

Assumption	Description	Consequence of Violation	Diagnostic Test
Linearity	The relationship between each predictor and the outcome is linear.	Biased and inefficient estimates.	Residual-vs-fitted plots; component-plus-residual plots.
Independence of errors	Residuals are not correlated with each other (especially across time or space).	Underestimated standard errors; inflated Type I error.	Durbin-Watson test; inspection of residual autocorrelation.
Homoscedasticity	The variance of residuals is constant across all predictor values.	Inefficient estimates; unreliable inference.	Breusch-Pagan test; scale-location plot.
Normality of residuals	Residuals are approximately normally distributed.	Unreliable p-values and CIs in small samples.	Q-Q plot; Shapiro-Wilk test (small samples).
No multicollinearity	Predictors are not highly correlated with each other.	Inflated standard errors; unstable coefficients.	Variance Inflation Factor (VIF) > 5 to 10.
No influential outliers	No single observation exerts undue influence on the regression line.	Distorted coefficients.	Cook’s distance; leverage statistics (hat values).
Correct model specification	The correct predictors are included; no relevant variables are omitted.	Omitted variable bias.	Theory-driven model building; residual inspection.

What Should Researchers Do When Assumptions Are Violated?

Several remedies are available depending on which assumption is violated:

Non-linearity: add polynomial terms, use splines, or switch to a nonlinear model.
Heteroscedasticity: use robust standard errors (e.g., Huber-White sandwich estimator) or apply weighted least squares.
Non-normality of residuals: transform the outcome variable (e.g., log transformation) or use a generalized linear model with an appropriate distribution.
Multicollinearity: remove one of the correlated predictors, combine them into a composite score, or use Ridge regression. Stepwise regression is not an appropriate remedy for multicollinearity.
Autocorrelation: use time-series models (ARIMA), add lagged variables, or use clustered standard errors for grouped data.

Types of Regression Analysis

The appropriate regression model is determined primarily by the nature of the dependent variable (continuous, binary, count, ordered, censored) and the relationship structure (linear or nonlinear). The following sections detail the major types, with particular attention to biomedical and social science applications.

Simple Linear Regression

Simple linear regression models the relationship between one continuous dependent variable and one continuous or binary independent variable. It fits a straight line through the data by the method of ordinary least squares, which minimizes the sum of squared differences between observed and predicted values.

The model equation is: Y = a + bX + e

The coefficient b (slope) represents the average change in Y for each one-unit increase in X. The intercept a represents the predicted value of Y when X equals zero (which may or may not be a meaningful value, depending on the scale of X).

Feature	Details
Number of predictors	1
Outcome variable type	Continuous (e.g., blood pressure, test scores, income)
Estimation method	Ordinary Least Squares (OLS)
Key output	Slope (b), intercept (a), R-squared, p-value for slope
Biomedical example	Predicting systolic blood pressure from body mass index (BMI)
Social science example	Estimating income as a function of years of education
Limitations	Only one predictor; does not control for confounding

Multiple Linear Regression

Multiple linear regression extends the simple model to include two or more predictors simultaneously. Each coefficient represents the relationship between one predictor and the outcome, holding all other predictors constant. This ‘controlling for’ property is what makes multiple regression the workhorse of observational research: it allows researchers to estimate the independent contribution of each variable while accounting for the influence of others.

The equation becomes: Y = a + b1X1 + b2X2 + … + bkXk + e

Important considerations in multiple regression include:

Variable selection: should be theory-driven rather than data-driven (e.g., avoiding stepwise selection, which capitalizes on chance and inflates Type I error).
Sample size: a widely used rule of thumb requires at least 10 to 20 observations per predictor, though simulation studies suggest this varies by effect size and R-squared.
Adjusted R-squared: should be reported in preference to R-squared in multiple regression because it adjusts for the number of predictors.
Interaction terms: can be added to test whether the effect of one predictor varies depending on the value of another (effect modification).

Feature	Details
Number of predictors	2 or more
Outcome variable type	Continuous
Estimation method	OLS
Key output	Coefficients with CIs, adjusted R-squared, F-statistic, AIC/BIC
Biomedical example	Predicting HbA1c from age, BMI, duration of diabetes, and physical activity level
Social science example	Modeling household income from education, race, region, and parental socioeconomic status
Limitations	Assumes linearity; sensitive to outliers and multicollinearity; requires larger samples with more predictors

Logistic Regression

Logistic regression is used when the dependent variable is binary (two categories: e.g., disease present/absent, survived/died, voted/did not vote). Rather than predicting the value of Y directly, it predicts the log odds (logit) of Y occurring, which is then converted to a probability via the logistic function. Coefficients are commonly exponentiated to produce odds ratios (ORs), which are more interpretable.

It is one of the most widely used methods in epidemiology and clinical medicine for studying disease risk factors and treatment outcomes.

Key types of logistic regression include:

Binary logistic regression: outcome has two categories (diseased vs. not diseased).
Multinomial logistic regression: outcome has three or more unordered categories (e.g., type of drug chosen, diagnosis category).
Ordinal logistic regression (proportional odds model): outcome has three or more ordered categories (e.g., pain severity: none, mild, moderate, severe).

Feature	Details
Outcome variable type	Binary (or multinomial/ordinal)
Estimation method	Maximum Likelihood Estimation (MLE)
Key output	Odds ratios with 95% CIs, Wald p-values, Nagelkerke/McFadden pseudo R-squared, Hosmer-Lemeshow test
Biomedical example	Estimating odds of myocardial infarction by smoking status, adjusted for age, sex, and cholesterol
Social science example	Predicting probability of voting (yes/no) from age, education, and political affiliation
Limitations	Cannot extrapolate probabilities beyond 0 and 1; requires larger samples than linear regression; does not directly estimate risk (only odds)

Poisson Regression

Poisson regression is appropriate when the outcome is a count variable representing the number of times an event occurs in a fixed period of time or space (e.g., number of emergency department visits per year, number of new cancer cases in a district). It assumes that the outcome follows a Poisson distribution and uses a log link function. Coefficients are exponentiated to produce incidence rate ratios (IRRs).

When count data are overdispersed (variance substantially exceeds the mean, which is common with health data), negative binomial regression is often preferred. Zero-inflated Poisson or zero-inflated negative binomial models are used when there are more zero counts than the Poisson distribution predicts.

Feature	Details
Outcome variable type	Count (non-negative integers)
Estimation method	MLE with log link
Key output	Incidence rate ratios (IRRs) with 95% CIs
Biomedical example	Modeling the number of asthma attacks per year as a function of air quality index, age, and medication adherence
Social science example	Counting the number of crimes per neighborhood as a function of poverty rate, police density, and unemployment
Common variant	Negative binomial regression (overdispersed counts); zero-inflated models (excess zeros)

Cox Proportional Hazards Regression

Cox proportional hazards regression (also called Cox regression or survival regression) is the standard method for analyzing time-to-event data in clinical and epidemiological research. It models the hazard function: the instantaneous risk of experiencing the event (e.g., death, disease relapse, treatment failure) at any given time, given that the event has not yet occurred. It handles censored data, where some subjects do not experience the event by the end of the study.

The proportional hazards assumption states that the ratio of hazard rates between any two groups remains constant over time. This must be verified, typically using Schoenfeld residual plots or formal tests.

Feature	Details
Outcome variable type	Time to event with censoring
Estimation method	Partial likelihood (semi-parametric; no distributional assumption for baseline hazard)
Key output	Hazard ratios (HRs) with 95% CIs and p-values
Biomedical example	Estimating the effect of chemotherapy regimen on time to cancer recurrence, adjusting for stage, age, and comorbidities
Social science example	Modeling time until re-arrest following prison release as a function of education, employment, and social support
Key assumption	Proportional hazards: the HR is constant over time

Polynomial and Nonlinear Regression

When the relationship between predictor and outcome is curved rather than linear, polynomial regression adds higher-order terms (X-squared, X-cubed) of the predictor to capture the nonlinear shape while remaining linear in its parameters (and thus still estimable by OLS). Nonlinear regression, by contrast, uses models that are intrinsically nonlinear in their parameters (e.g., exponential growth or dose-response curves) and requires iterative numerical estimation.

In pharmacology and toxicology, the dose-response relationship is often sigmoidal; in epidemiology, the relationship between body weight and mortality risk is often U-shaped. Both scenarios benefit from nonlinear or polynomial approaches.

Type	When to Use	Example
Polynomial	Curved relationship; can still use OLS	Modeling the U-shaped relationship between alcohol consumption and cardiovascular risk
Nonlinear	Intrinsically nonlinear model required; uses iterative estimation	Fitting a dose-response curve for a drug (e.g., Hill equation)
Spline regression	Flexible nonlinear curves without specifying function shape in advance	Modeling the effect of age on disease prevalence without assuming a specific functional form

Ridge, Lasso, and Elastic Net Regression

These regularized regression methods add a penalty term to the estimation criterion to shrink coefficient estimates toward zero, reducing overfitting and improving prediction in models with many predictors (high-dimensional data). They are especially relevant in genomics, proteomics, and large-scale social surveys.

Ridge regression (L2 penalty): shrinks all coefficients but keeps all predictors in the model. Useful when many predictors each contribute a small amount.
Lasso regression (L1 penalty): can shrink some coefficients exactly to zero, effectively performing variable selection. Useful when only a subset of predictors are expected to be relevant.
Elastic Net: combines L1 and L2 penalties, offering a middle ground. Useful when predictors are correlated with each other.

The penalty strength is controlled by a tuning parameter (lambda) that is typically selected by cross-validation.

Other Regression Methods Used in Research

Method	Dependent Variable	Primary Use Case
Quantile regression	Continuous; models conditional quantiles (e.g., median)	When the effect of predictors varies across the distribution of outcomes; robust to outliers
Tobit regression	Censored continuous variable (observed only above/below a threshold)	Modeling expenditure, wages, or health utilization data with floor or ceiling effects
Negative binomial regression	Overdispersed count data	Modeling count outcomes with variance exceeding the mean (very common in health data)
Hierarchical / multilevel regression	Continuous or categorical outcome in nested data	Students within schools; patients within hospitals; observations within individuals over time
Structural equation modeling (SEM)	Latent or observed outcome with mediation/indirect paths	Testing path models in psychology, education, and sociology; measuring unobserved constructs
Instrumental variable regression	Continuous outcome; endogenous predictor	Addressing unmeasured confounding using an instrument; popular in health economics and epidemiology
Panel data regression (fixed/random effects)	Outcome measured repeatedly on same subjects	Longitudinal studies in economics, sociology, and clinical research

Regression Analysis in Biomedical Research

Biomedical research relies heavily on regression because clinical phenomena involve multiple co-occurring risk factors, confounders, and effect modifiers. Regression allows researchers to isolate the independent contribution of each factor to an outcome while controlling for the others.

How Is Regression Used in Epidemiology?

In epidemiology, regression is used to estimate the association between exposures and disease outcomes while controlling for confounders. The appropriate model depends on the study design and outcome type:

Study Type	Typical Outcome	Preferred Regression Model
Cross-sectional	Disease prevalence (binary)	Logistic regression; Poisson regression with robust variance for prevalence ratios
Case-control	Disease status (binary)	Conditional or unconditional logistic regression
Cohort (fixed follow-up)	Disease incidence (binary)	Log-binomial regression (for risk ratios) or logistic regression (for ORs)
Cohort (variable follow-up)	Time to event	Cox proportional hazards regression
Ecological	Population-level rates or means	Poisson regression; linear regression with aggregated data

A critical issue in epidemiological regression is confounding: a third variable that is associated with both the exposure and the outcome and that may distort the apparent association. Regression addresses confounding by including confounders as covariates in the model, but only measured confounders can be controlled for in this way. Unmeasured confounding remains a fundamental threat to validity in observational studies.

Logistic Regression in Clinical Medicine

Logistic regression is ubiquitous in clinical research for identifying risk factors for disease, predicting patient outcomes, and developing clinical prediction scores. Examples include:

Cardiovascular risk models: predicting 10-year risk of cardiovascular events from age, sex, blood pressure, cholesterol, smoking status, and diabetes (e.g., Framingham Risk Score derivation).
Diagnostic accuracy: estimating the probability of a diagnosis based on clinical signs, symptoms, and test results.
Treatment response prediction: identifying patient characteristics predictive of response to a specific therapy.
Surgical risk stratification: estimating perioperative mortality risk from patient comorbidities and procedure type.

When reporting logistic regression in clinical journals, researchers should report: crude and adjusted odds ratios, 95% confidence intervals, the number of events per variable, goodness-of-fit statistics (e.g., Hosmer-Lemeshow test), and the area under the ROC curve (c-statistic) as a measure of model discrimination.

Survival Analysis in Oncology and Chronic Disease Research

Cox proportional hazards regression is the predominant tool in clinical trials and observational studies involving time-to-event endpoints such as overall survival, disease-free survival, or time to hospitalization. Its advantages include:

It does not require specification of the baseline hazard function (semi-parametric), which increases robustness.
It can accommodate censored observations, which are universal in longitudinal studies.
It can include time-varying covariates for exposures or conditions that change during follow-up.

Accelerated failure time (AFT) models are a parametric alternative to Cox regression that directly model the log of survival time; they are used when the proportional hazards assumption is violated.

Regression in Pharmacology and Drug Development

In pharmacological research, regression models serve several specialized purposes:

Dose-response modeling: nonlinear regression (often using the four-parameter logistic model) fits sigmoidal dose-response curves to determine EC50 (effective concentration for 50% response) and Hill coefficients.
Pharmacokinetic modeling: nonlinear mixed-effects models (e.g., NONMEM) characterize drug absorption, distribution, metabolism, and elimination across individuals.
Drug interaction studies: regression with interaction terms tests whether the combined effect of two drugs differs from what would be expected from their individual effects.

Genome-Wide Association Studies and High-Dimensional Data

Modern genomic research involves testing the association between each of millions of genetic variants (single nucleotide polymorphisms, SNPs) and a phenotype. Linear regression (for continuous traits) or logistic regression (for binary traits) is used for each variant separately. Challenges unique to this setting include:

The need for multiple testing correction (e.g., Bonferroni correction or false discovery rate control) to limit false positives.
Population stratification (genetic ancestry differences between cases and controls) as a confounder; addressed by including principal components of genetic data as covariates.
High dimensionality: Lasso and Ridge regression are used for polygenic score construction across many variants simultaneously.

Regression Analysis in the Social Sciences

The social sciences rely on regression to study complex phenomena where many social, economic, cultural, and structural forces operate simultaneously. Because randomized experiments are usually impractical or unethical in social contexts, observational regression with careful control for confounders is often the primary analytic tool.

What Makes Regression Central to Social Science Research?

Social science outcomes (income, educational attainment, health behavior, voting, crime, and wellbeing) are almost always the product of multiple interacting factors. Regression allows researchers to decompose the contribution of each factor independently. Key applications include:

Economics: earnings functions estimate the wage returns to education, experience, and demographic characteristics; regression underpins national income accounting, labor market analysis, and policy evaluation.
Sociology: regression models examine how social stratification, race, gender, and class shape life outcomes from educational achievement to health and mortality.
Political science: regression analyzes the predictors of voting behavior, policy preferences, political participation, and electoral outcomes.
Psychology: multiple regression is foundational to testing mediating and moderating relationships between psychological constructs; structural equation modeling extends this to latent variable frameworks.
Public policy research: regression-based program evaluation estimates the causal impact of interventions (e.g., job training programs, educational subsidies, housing policies) on target outcomes.

Hierarchical (Multilevel) Regression in Social Research

Social data are inherently nested: students within classrooms within schools; individuals within households within neighborhoods; employees within firms. Standard regression treats all observations as independent, which is violated in nested data and leads to underestimated standard errors. Multilevel regression (also called hierarchical linear modeling, HLM, or mixed-effects modeling) accounts for this clustering by modeling both within-group and between-group variation.

Common applications in social science include:

School effects research: separating student-level and school-level predictors of academic achievement.
Neighborhood effects: estimating the independent effect of neighborhood poverty on individual health outcomes after controlling for individual socioeconomic status.
Cross-national comparative studies: comparing the effects of social policies across countries while accounting for country-level clustering.
Longitudinal panel data: modeling change over time within individuals, controlling for stable unmeasured individual characteristics.

Mediation and Moderation Analysis

In social and behavioral research, understanding the pathway through which a predictor affects an outcome (mediation) or the conditions under which an effect varies (moderation) is often as important as the main effect itself.

Mediation: a mediating variable lies on the causal pathway between the predictor and the outcome (e.g., socioeconomic status affects health through access to nutritious food and healthcare). The Baron and Kenny approach uses a series of regression equations; bootstrapped indirect effects using the PROCESS macro (Hayes) have become the modern standard.
Moderation: a moderating variable changes the strength or direction of the predictor-outcome relationship (e.g., the effect of education on income may differ by sex or race). Moderation is tested by including an interaction term (predictor x moderator) in the regression model.

Quasi-Experimental Methods in Social Science

When randomization is not possible, researchers use quasi-experimental designs that leverage regression to approximate causal inference:

Regression discontinuity design (RDD): exploits sharp cutoffs in program eligibility (e.g., test score thresholds for scholarships) to estimate treatment effects by comparing outcomes just above and below the cutoff.
Difference-in-differences (DiD): compares the change in outcomes over time between a group exposed to a policy change and a comparison group not exposed; requires the parallel trends assumption.
Instrumental variable (IV) regression: uses a third variable (the instrument) that affects the predictor of interest but has no direct effect on the outcome, to isolate exogenous variation and estimate causal effects.

Common Pitfalls in Social Science Regression

Pitfall	Description	Remedy
Omitted variable bias	A relevant confounder is left out of the model, biasing the coefficient of interest.	Include all theoretically relevant confounders; use fixed effects or IV methods for unmeasured confounding.
Reverse causality	The direction of causation is unclear (the outcome may cause the predictor, not vice versa).	Use temporal ordering; longitudinal data with lagged predictors; IV methods.
Ecological fallacy	Inferences from group-level regression are incorrectly applied to individuals.	Use individual-level data where possible; acknowledge level of analysis in interpretation.
Overfitting due to exploratory model building	Testing many models and reporting only those with significant results (p-hacking).	Pre-register hypotheses; report all models tested; apply corrections for multiple testing.
Misinterpretation of standardized vs. unstandardized coefficients	Comparing standardized coefficients across groups or datasets without accounting for different variances.	Report both; interpret unstandardized coefficients in context of variable units.

What Do Regression Outputs Mean, and How Should They Be Reported?

Regression outputs should be interpreted carefully and reported completely. The key statistics and their meanings are described below.

Interpreting Regression Coefficients

The regression coefficient (b) for a predictor represents the expected change in the outcome for a one-unit increase in that predictor, holding all other predictors in the model constant. Several important points:

For continuous predictors, a one-unit increase means an increase of whatever unit the predictor is measured in (e.g., 1 year of age, 1 mmHg of blood pressure, 1 standard deviation of income).
For binary predictors (e.g., male vs. female, treated vs. untreated), the coefficient is the estimated difference in the mean outcome between the two groups.
For categorical predictors with more than two categories, dummy variables are created and each coefficient represents the difference from the reference category.
Standardized coefficients (beta weights) allow comparison of the relative importance of predictors measured on different scales, but should not be used to make cross-group or cross-study comparisons.

R-Squared and Model Fit

R-squared (R2) represents the proportion of variance in the outcome explained by all predictors together. It ranges from 0 (no explanatory power) to 1 (perfect fit). Important caveats:

R-squared increases with each additional predictor, even if that predictor is not truly related to the outcome. Adjusted R-squared corrects for this by penalizing additional predictors.
A high R-squared does not mean the model is correctly specified or that the coefficients are unbiased; it simply reflects how well the model fits the data in hand.
In biomedical and social science research, R-squared values of 0.10 to 0.30 are often considered meaningful given the complexity and noise in human data.
For logistic and other generalized linear models, pseudo R-squared measures (Nagelkerke, McFadden, Cox and Snell) approximate R-squared but are not directly comparable to OLS R-squared.

P-Values and Confidence Intervals in Regression

Each regression coefficient is tested against the null hypothesis that the true coefficient equals zero (no association). The t-statistic (or z-statistic for logistic regression) and associated p-value test this. Important considerations for researchers:

A statistically significant p-value (typically < 0.05) means the data are unlikely under the null hypothesis of no association; it does not indicate practical or clinical importance.
Confidence intervals are more informative than p-values alone: they convey both the magnitude of the association and the precision of the estimate.
Multiple testing: when many predictors are tested simultaneously, the chance of at least one false positive increases substantially. Correction methods (Bonferroni, Benjamini-Hochberg FDR) should be considered when testing large numbers of predictors exploratorily.
Studies of public health researchers consistently find that regression coefficients are not interpreted as mechanically objective values; context, study design, data quality, and prior evidence all inform interpretation.

Recommended Reporting Checklist for Regression Studies

Element	Details to Report
Sample description	Sample size, number of events (for binary outcomes), missing data handling
Model specification	All predictors included; rationale for model building strategy; interaction terms if any
Coefficients	Unstandardized b (and standardized beta if applicable) with 95% confidence intervals for each predictor
Statistical significance	p-value for each coefficient; overall F-test or chi-squared test for model significance
Model fit	R-squared and adjusted R-squared (linear); pseudo R-squared and c-statistic (logistic); AIC and BIC
Assumption checks	Results of key diagnostics (VIF for multicollinearity, residual plots, goodness-of-fit tests)
Sensitivity analyses	Results under alternative model specifications or handling of missing data
Limitations	Key threats to validity: unmeasured confounding, selection bias, measurement error

Overfitting: What It Is and How to Avoid It

Overfitting occurs when a regression model fits the sample data too closely, capturing random noise as if it were a true signal. The result is a model that appears to perform very well on the data used to build it but generalizes poorly to new observations. Overfitting is most common when many predictors are included relative to the number of observations (low events-per-variable ratio).

Indicators of overfitting include:

Large discrepancy between R-squared and adjusted R-squared.
Model performs substantially worse on a holdout dataset than on the training data.
Coefficients are implausibly large or have unexpected signs inconsistent with prior knowledge.
Confidence intervals are very wide, reflecting high uncertainty in estimates.

Strategies to prevent overfitting in research:

Limit the number of predictors: use a priori theory and prior literature to select predictors rather than testing all available variables.
Cross-validation: divide the data into training and validation sets (or use k-fold cross-validation) to obtain unbiased estimates of model performance.
Regularization: use Ridge, Lasso, or Elastic Net regression when the predictor-to-sample ratio is high.
Internal validation with bootstrap: in clinical prediction model development, bootstrapping with optimism correction provides internally validated performance estimates.

Regression in Machine Learning: A Brief Overview

Machine learning uses many of the same regression techniques described above but emphasizes prediction accuracy over statistical inference. The primary differences in orientation are summarized in the table below.

Dimension	Classical Statistical Regression	Machine Learning Regression
Primary goal	Inference: understand and quantify relationships; test hypotheses	Prediction: maximize accuracy on new data
Model selection	Theory-driven; parsimony valued	Performance-driven; large models tolerated with regularization
Key metrics	Coefficient estimates, p-values, CIs, R-squared, AIC/BIC	RMSE, MAE, cross-validated R-squared, prediction error
Interpretability	High: individual coefficients have clear meaning	Often lower: ensemble methods (Random Forest, gradient boosting) are ‘black boxes’
Overfitting control	Adjusted R-squared, model parsimony	Regularization (Ridge, Lasso), cross-validation, early stopping
Common algorithms	OLS, logistic, Cox, Poisson regression	Ridge, Lasso, Elastic Net, Random Forest regression, gradient boosting, neural networks

In biomedical and social research, classical regression remains dominant for hypothesis testing and causal inference. Machine learning regression methods are increasingly used for prediction tasks such as clinical risk scoring, image-based diagnosis, and natural language processing of health records, but require careful validation and consideration of fairness and interpretability.

How Do You Choose the Right Regression Model?

The right regression model is determined primarily by the nature of the dependent variable, the number of predictors, and the structure of the data. The following decision guide provides a starting framework. Consultation with a statistician is recommended for complex designs.

Dependent Variable Type	Data Structure	Recommended Regression Type
Continuous (normal distribution)	Single predictor; independent observations	Simple linear regression
Continuous (normal distribution)	Multiple predictors; independent observations	Multiple linear regression
Continuous (normal distribution)	Nested/clustered data (e.g., patients in hospitals)	Multilevel/mixed-effects linear regression
Continuous; curved relationship	Any	Polynomial regression or spline regression
Continuous; many predictors relative to sample size	Any	Ridge or Lasso regression
Binary (yes/no, event/no event)	Independent observations	Binary logistic regression
Binary; many predictors	Independent observations	Penalized logistic regression (Lasso/Ridge)
Ordered categories (mild/moderate/severe)	Independent observations	Ordinal logistic regression
Unordered categories (3+ groups)	Independent observations	Multinomial logistic regression
Count data (0, 1, 2, 3…)	Equal exposure time	Poisson regression
Count data with overdispersion	Any	Negative binomial regression
Time to event with censoring	Any	Cox proportional hazards regression
Censored continuous variable	Any	Tobit regression
Latent constructs; indirect effects	Any; requires validated scales	Structural equation modeling (SEM)

Software for Regression Analysis

Multiple software packages support regression analysis in research. The choice depends on the researcher’s programming experience, available institutional licenses, and the complexity of the analysis required.

Software	Primary Strengths	Common Use Cases
R (free, open-source)	Extremely comprehensive; lm(), glm(), survival, lme4, coxph, brms packages; excellent visualization; active community	All regression types; academic research; reproducible analysis with R Markdown
Stata	User-friendly syntax; excellent panel data, survival, and survey regression commands; widely used in economics and epidemiology	Longitudinal panel regression, survival analysis, survey-weighted regression, IV regression
SAS	Gold standard in pharmaceutical and regulatory contexts; PROC REG, LOGISTIC, PHREG, MIXED; strong for GxE interaction models	Clinical trials, FDA submissions, pharmacoepidemiological research
SPSS	Point-and-click interface; well-suited for those with limited programming experience	Social science surveys, educational research, descriptive regression
Python (free)	sklearn, statsmodels, lifelines, pingouin libraries; excellent for machine learning regression and large datasets	Predictive modeling, genomic data, machine learning integration
JASP / jamovi (free)	GUI-based; built on R; accessible for teaching and non-programmers	Teaching; exploratory regression; meta-analysis

Tips for Your First Regression Analysis

Before you even open your software

Write down your research question in one sentence. If you can’t do that, your analysis isn’t ready to run yet.
Decide on your outcome variable and your predictors before looking at the data. Post-hoc variable selection is one of the most common sources of bias in student projects.
Check whether your outcome is continuous, binary, a count, or time-to-event. That single decision determines which model you need.

Getting your data ready

Look at your data first: run frequencies, histograms, and a correlation matrix before touching a regression function. Surprises at this stage are far easier to fix than surprises after modeling.
Check for missing data and decide how you’ll handle it. Deleting incomplete rows is the default in most software, but it’s rarely the best choice.
If you have categorical predictors (e.g., ethnicity, treatment group), make sure your software is treating them as categories, not as numbers. A coding error here silently ruins your results.

Running the model

Start simple: run a crude (unadjusted) model with just your main predictor first, then add covariates one step at a time. This helps you understand what each variable is doing.
Don’t include every variable available to you. A rough rule: you need at least 10 observations (or 10 events, for logistic/Cox) per predictor you include.
Save your code or syntax. If your supervisor asks you to re-run with one variable changed, you’ll be glad you did.

Checking your results

Always look at your residual plots before reporting anything. A regression output with no assumption checks is incomplete work.
A significant p-value is not the finish line. Ask: is the effect size meaningful? Is the direction of the effect consistent with theory?
Report confidence intervals alongside p-values. A wide CI tells you the estimate is imprecise, even if p < 0.05.

The most common mistakes to avoid

Confusing a mediator for a confounder and controlling for the wrong variable.
Running many models and only reporting the one that gave significant results.
Interpreting a regression coefficient as proof of causation.
Forgetting to check for multicollinearity (run VIF on every multiple regression model).
Reporting R-squared without adjusted R-squared in multiple regression.

One reassuring thing: a well-specified model with three predictors and honest interpretation is always better science than a sprawling model with twenty predictors and p-value fishing. Simplicity, transparency, and a clear research question will take you further than complexity.

Frequently Asked Questions

The following questions are based on topics and angles commonly raised by researchers in academic forums and statistical communities that are not fully addressed in the main text.

1. When should I use odds ratios versus risk ratios, and can regression give me both?

Logistic regression produces odds ratios (ORs) by default, not risk ratios (RRs). For rare outcomes (prevalence below approximately 10%), the OR closely approximates the RR and both are acceptable. For common outcomes, however, the OR substantially overestimates the RR and can be misleading. Researchers who need risk ratios for common binary outcomes in cohort or cross-sectional studies should consider log-binomial regression (using a log link with a binomial distribution), which directly produces RRs. If log-binomial models fail to converge, modified Poisson regression with robust variance is a reliable alternative and is increasingly recommended in epidemiological literature.

2. Does a higher R-squared always mean a better model?

No. R-squared measures goodness of fit to the sample but can be artificially inflated by adding more predictors, even irrelevant ones. Adjusted R-squared, which penalizes for the number of predictors, is a better indicator of model quality. Beyond R-squared, model quality also depends on correct specification, absence of assumption violations, and out-of-sample performance. In social and biomedical research, a low R-squared does not necessarily indicate a worthless model; the coefficients of specific predictors can still be meaningful and important even when overall explanatory power is modest.

3. How do I handle missing data in a regression analysis?

Missing data is pervasive in research and, if ignored, can introduce bias depending on why data are missing. The three missing data mechanisms are: missing completely at random (MCAR), missing at random (MAR, where missingness depends on observed variables), and missing not at random (MNAR, where missingness depends on the unobserved missing value itself). Listwise deletion (excluding any case with a missing value) is the default in most software but is only unbiased under MCAR and reduces statistical power. Multiple imputation by chained equations (MICE) is the recommended approach under MAR and is supported by R (mice package), Stata, SAS, and SPSS. MNAR requires sensitivity analyses with specific assumptions. Researchers should always describe their missing data handling in publications.

4. Is it appropriate to include mediators as covariates in a regression model?

No, in most cases it is not appropriate. If a variable lies on the causal pathway between the exposure and the outcome (i.e., it is a mediator), including it as a covariate will over-adjust the estimate of the exposure’s effect and may introduce collider bias. For example, if studying the effect of smoking on cardiovascular disease, lung function should not be adjusted for if it is a mechanism through which smoking causes disease. Mediators should instead be analyzed formally using mediation analysis (e.g., the Baron-Kenny method or the more modern potential outcomes framework with bootstrapped confidence intervals). The distinction between a confounder (should be adjusted for) and a mediator (should not be adjusted for) requires substantive knowledge of the causal structure of the question.

5. When is it better to use a Poisson model versus a negative binomial model for count outcomes?

The Poisson distribution assumes that the mean and variance of the count outcome are equal. In practice, health and social data almost always show overdispersion: the variance exceeds the mean (e.g., the number of hospital admissions per person is often dominated by a small number of very high users). Negative binomial regression adds an extra parameter (the dispersion parameter) to accommodate this extra variance. A likelihood ratio test comparing the Poisson and negative binomial models can formally test for overdispersion. If overdispersion is present, negative binomial regression produces correct standard errors and is preferred. For zero-inflated data (more zeros than the Poisson or negative binomial distribution predicts), zero-inflated versions of both models are available and should be considered.

6. How many predictors can I include in a regression model?

For linear regression, there is no strict cutoff, but a common practical guideline is approximately 10 to 20 observations per predictor to avoid overfitting. For logistic regression and Cox regression, the relevant metric is the number of events per variable (EPV): at least 10 events per predictor is a widely cited minimum, though simulations suggest that lower EPVs lead to overfitted and biased models. With fewer events than predictors (as in genomics or proteomics), regularized regression (Lasso, Ridge) or dimensionality reduction before regression is necessary. For any model with many predictors, internal validation (bootstrapping with optimism correction) or external validation on an independent dataset should be performed before the model is applied or reported.

7. What is the difference between fixed effects and random effects in multilevel regression, and which should I use?

Fixed effects models control for all time-invariant characteristics of each unit (person, school, country) by using within-unit variation only; they are ideal when the goal is to estimate the effect of a predictor that varies over time within units, and when unmeasured unit-level confounding is a concern. Random effects models assume that unit-specific deviations are randomly drawn from a distribution and are uncorrelated with the predictors; they are more efficient and can include unit-level (time-invariant) predictors, but are biased if the random effects are correlated with the predictors. The Hausman test can help researchers choose between the two in panel data settings. In social science, fixed effects regression is the preferred approach when causal identification is the priority.

8. Can I use regression to compare two separate regression lines (e.g., between males and females)?

Yes, this is done using interaction terms. Rather than running two separate regression models (one for each group), the recommended approach is to run a single model that includes the group variable, all predictors, and the interaction between the group variable and the predictor(s) of interest. A statistically significant interaction term indicates that the slope of the predictor on the outcome differs between groups. Comparing models run separately in subgroups is not statistically appropriate because it does not formally test whether the difference in coefficients is significant and produces incorrect standard errors for that comparison. Some disciplines use the Chow test (an F-test of the null hypothesis that all regression coefficients are equal across groups) as a formal test of structural equivalence between regression lines.

References

The following sources were consulted in the preparation of this guide:

Statistics How To. Regression Analysis: Step by Step Articles, Videos, Simple Definitions. statisticshowto.com. Accessed June 2026.
Frost J. Regression Analysis. Statistics By Jim Glossary. statisticsbyjim.com. Accessed June 2026.
Appier. 5 Types of Regression Analysis and When to Use Them. appier.com. Accessed June 2026.
ACCA Global. Regression and Correlation. accaglobal.com. Accessed June 2026.
Xi W-F, Jiang Q-W, Yang A-M. Using Stepwise Regression to Address Multicollinearity Is Not Appropriate. International Journal of Surgery. 2024.
Barton SJ, Melton PE, et al. In Epigenomic Studies, Including Cell-Type Adjustments in Regression Models Can Introduce Multicollinearity. Frontiers in Genetics. 2019;10:816.
Rijnhart JJM, et al. Regression and Causality. arXiv preprint arXiv:2006.11754. 2020.
VanderWeele TJ, Mathur MB. Outcome-Wide Longitudinal Designs for Causal Inference: A New Template for Empirical Studies. Statistical Science. 2020.
Richards JB, Patel CJ, et al. How to Draw the Line in Biomedical Research. eLife. 2013.
Dwyer DB, et al. Educating the Future Generation of Researchers: A Cross-Disciplinary Survey of Trends in Analysis Methods. PLOS Biology. 2021.
Richards A, et al. The Eye of the Beholder: How Do Public Health Researchers Interpret Regression Coefficients? A Qualitative Study. BMC Medical Research Methodology. 2024.

Frequency Distributions and Their Uses in Biomedical Research

Data Collection Methods in Research: Types of Data, Examples, Tips