Analyzing clustered data: What biomedical researchers need to know

2 mins

Clustered data occurs when data points are not independent, but rather grouped into clusters or units, such as patients within hospitals, families within communities, or repeated measurements on the same individuals over time. Biomedical researchers often encounter clustered data, especially in fields like epidemiology and public health or in clinical trials. Analyzing clustered data properly is crucial to account for the non-independence of observations and obtain accurate and valid results. This blogpost explains some key considerations, along with common approaches and their advantages and disadvantages.

Intraclass Correlation (ICC)

ICC measures the proportion of total variance that is attributed to between-cluster variation. It helps researchers assess the degree of clustering in their data.

• Provides a quantitative measure of the extent of clustering.
• Useful for designing efficient cluster-randomized trials.

• Does not provide information on the direction or specific sources of clustering.
• Cannot be used for hypothesis testing or model fitting.

Generalized Estimating Equations (GEE)

GEE is a statistical method used to analyze data that has some level of correlation or grouping. It’s like a tool that helps researchers make sense of information when the data points are not entirely independent, such as when studying people in different families or patients in various hospitals.

• Applicable to a wide range of response variables, including binary, continuous, and count data.
• Robust to misspecified correlation structures.
• No need to specify the full likelihood distribution.

• Assumes the correct correlation structure, which may not always be known.
• May be less efficient for small clusters.

Linear Mixed Effects Models (LMM) and Generalized Linear Mixed Models (GLMM)

LMM and GLMM [MR1] are parametric models that incorporate fixed effects and random effects to account for clustering. Let’s say you’re studying whether an intervention reduces inpatient fall rates in hospitals. LMM and GLMM are like a tool that combines two things:

• A “fixed” part: This is where you examine the general effect of the intervention.
• A “random” part: This part considers the differences between hospitals. It accounts for the fact that patients in the same hospital might have similar fall rates, different from patients in other hospitals, because of existing fall-prevention policies or protocols that a hospital follows or because one hospital does not admit X type of patient.

LMM or GLMM allows you to figure out if the intervention works, taking into account that fall rates might vary from one hospital to another.

In simple terms, both LMM and GLMM help researchers understand data that involves groups or clusters by looking at both the general trends (fixed effects) and the differences between the groups (random effects). They’re like tools that help uncover the bigger picture while considering the unique characteristics of each group or cluster.

• Allow for the modeling of both within-cluster and between-cluster variability.
• Flexible in accommodating different correlation structures.

• Require specifying a specific correlation structure, which can be challenging.
• May become computationally intensive for large datasets.

Hierarchical Models

Hierarchical models, like Bayesian hierarchical models, explicitly model data at multiple levels of hierarchy, making them suitable for analyzing clustered data.

• Allow for estimation of cluster-specific and population-level effects.
• Flexibility in incorporating prior information.

• May require advanced statistical expertise, especially for Bayesian modeling.
• Computationally demanding, particularly for complex models.

Cluster-Robust Standard Errors

Cluster-robust standard errors are used with standard regression models to adjust for clustering at the inference stage.

• Simple and computationally efficient.

• Appropriate only when clustering does not affect the parameter estimates (i.e., it addresses inference, not modeling).
• Can underestimate the true standard errors if clustering significantly impacts the results.

Conclusion

It’s essential to account for clustering properly to avoid biased or inefficient parameter estimates and to draw valid conclusions from clustered data. The choice of approach depends on the nature of the data, the research question, and the available computational resources. You’ll need to carefully consider the advantages and disadvantages of each method and select the one that best addresses your specific research needs.

Do you need help selecting the best approach to deal with clustered data in your research project? Talk to an experienced biostatistician under Editage’s Statistical Analysis & Review Services.

Marisha Fonseca

An editor at heart and perfectionist by disposition, providing solutions for journals, publishers, and universities in areas like alt-text writing and publication consultancy.

Found this useful?

If so, share it with your fellow researchers