Before running any kind of statistical analysis, it’s important to check the characteristics of your data, especially its distribution. Understanding the distribution of your data can help you make informed decisions while conducting subsequent statistical analyses. For instance, using parametric tests on non-parametric (i.e., not normally distributed) data is a frequent error in biomedical research. Today, we chat with Dr. Jacob Wickham about data distributions and how to deal with them, both during your analysis and while writing your research paper.
Dr. Jacob Wickham, Managing Editor of the journal Integrative Zoology, is an Assistant Professor at the Institute of Zoology in the Chinese Academy of Sciences and Adjunct Professor in the Department of Entomology at Rutgers University. An award-winning and celebrated zoologist, Dr. Wickham has over 15 years of experience in academic publishing and had published several papers in leading journals. Dr. Wickham has gained a lot of valuable experience in research and journal publishing over the years and is passionate about sharing his knowledge with researchers to help them in their publication journey.
In this interview, Dr. Wickham shares the importance of checking your data before commencing any kind of statistical analysis, especially for normality. He also outlines a practical set of steps you can follow to make sure that you’re choosing tests suitable to your data, be it normally distributed or not. Dr. Wickham shares tips on how to best summarize your data using measures of central tendency as well as on dealing with outliers.
Do you want to gain a deeper understanding of different data distributions and how to use them? Sign up for this webinar with Dr. Jacob Wickham.
- How important is it to include information about the type and distribution of data in the manuscript? What are the most effective ways of putting across this information?
It’s important to show the variability of your data, especially in your figures and graphs. Reviewers of grant proposals or journal articles will want to see this. Means should have error bars or confidence intervals. Linear regressions should have residuals. With regard to types of data, it depends on whether they are categorical, interval, or ratio data.
- How does one identify a normal vs non-normal distribution, and what’s the best way to check for normality?
There are simple checks you can do before you analyze your data with parametric tests (done on normal data) and non-parametric tests (done on non-normal data). One is Levene’s test, which assumes homogeneity of variances, or in other words, tests the equality of variance between two different sample populations. If the Levene’s test is statistically significant (p < 0.05), then we reject its null hypothesis of equal population variances and use a non-parametric test (such as Kruskall-Wallis ANOVA).
You can also check for normality using any one of these tests: 1. Shapiro-Wilk test, 2. Shapiro-Francia test (n < 50 / n > 50), 3. Anderson-Darling test, 4. Jarque – Bera test, 5. Cramer-von Mises test, 6. d’Agostino-Pearson test. And then, plot a histogram of your data with a normal distribution overlay.
- Could you explain measures of central tendency and variance in a dataset? What should researchers keep in mind while summarizing data?
While there is only one variance (which gives a great measure of the average variability or how dispersed your data are around the average, and ANOVA is a powerful analysis), there are multiple measures of central tendency, namely the mean, and median. In some cases, the median, or middle value (the point where half your data are higher or lower) may be a better description if your data or skewed or have an outlier. Then, there’s the mode or the value that occurs the most. Researchers should be open to what best summarizes their data, and it may help to plot a frequency distribution histogram when you are first exploring your data.
- What about outliers? Is it essential to check for and treat them? Does an author have to specify this in a research paper?
Outliers can greatly affect the mean and variance, and also the slope of linear regression. There is an outlier analysis that you could do before excluding an outlier, but you must tell the reviewers or readers that you did this. It may be an unreliable data point that is better excluded. But be careful; knowingly omitting or excluding data could be construed as data falsification, which is a form of research misconduct. However, if you do an outlier analysis and explain yourself, then it should be ok.
- Finally, what should a researcher keep in mind when choosing a statistical test?
Keep in mind your data distribution. Is it normal or non-normal data? Do you need to transform your data before you do your statistical tests? Are there categorical (or dummy variables) to classify groups of data? Remember the 3 assumptions of ANOVA (1. Independent samples, 2. Normal distribution, and 3. Homogeneity of variances). Familiarize yourself with different statistical programs, whether it’s Minitab, SPSS, Statistica, SAS, or R (just to name a few). Even MS Excel has a lot of great options. Explore your data with histograms and descriptive statistics before choosing your test. When you do choose a test, try for a more robust test like ANOVA followed by post hoc pairwise comparisons over multiple t-tests, which causes you to lose statistical power. For post hoc tests, note a Tukey’s or Bonferroni test for Pairwise Mean Comparisons is better than the less conservative Fisher’s Least Significant Difference. There are also multiple ways to analyze your data so check the literature on how similar data sets were analyzed and be open to trying multiple tests.
If you would like expert assistance in choosing the right statistical tests and performing your statistical analysis, check out Editage’s Statistical Analysis & Review Services.