Stat chit-chat: why I (almost) never use tests of normality, and do not recommend them — Blog and news on ethical AI and statistics in biomedicine.

(This article is a more mature version of one of my recent LinkedIn posts. I thought it could be useful for the community to elaborate a bit further)

“Magic mirror on the wall, who is the fairest of them all?" Just like the Evil Queen is obsessed with her beauty and grace in Snow White, life scientists often reduce data quality control prior statistical analysis to asking tests of normality if the data at hand are beautifully Gaussian (the other terminology for normality). The approach turns out to be largely biased and ineffective, not only because it often eclipses other key assumptions, but also because these tests are usually misunderstood and misused.

Today, I expose why tests of normality are deceptive. Note that, for the sake of conciseness, the surrogate methods for checking normality will not be discussed in detail (one thing after the other!)

Express cheat sheet on the normality assumption

Whether normality is a vital premise that needs to be obsessively verified systematically or a distraction that crystallises too much attention is a little beyond the scope of this article. In any case, I believe it is worth spending a minute providing an overview of the overall logic of this assumption.

All statistical tests make assumptions, whose robustness enables the use of the sampling distribution of the test statistic to mathematically derive reliable p-values and control the type I errors, under the premise that the null hypothesis is true. In so-called exact tests, no excess fuss, these probabilities are directly calculated from the data. But these tests are not very common and have their own drawbacks.

Conversely, asymptotic tests are constructed with the logic that the theoretical distribution of the test statistic for an infinite sample size would be sufficiently well approximated by a mathematically defined probability law (called a probability density or mass function, PDF or PMF, for continuous and discrete data respectively). By assuming that this mathematically manageable probability distribution is a faithful depiction of how the test statistic behaves across samples, calculus methods of integration (for PDF) or summation (for PMF) are then used to calculate p-values, which are areas under the curve beyond the unique empirical value.

The crux of this logic is the trustworthiness of the approximation.

For one family of asymptotic tests called parametric tests, one assumption among others is distributional: a conjecture is made that a specific population parameter has a precise distribution, which is often the normal distribution. If the distributional assumption does not hold, the chances are that the PDF used to compute the p-values does not reflect the actual invisible structure of the population variable. The p-value we arrive at would therefore be deceptive.

So, to sum up, the assumption of normality is important for many parametric tests, and relate to some unobserved properties of the population the data was sampled from, such as the overall structure of the variable itself (e.g. Student’s t test) or the errors between individual values and the model (e.g. regression).

Of course, assumptions and PDF are ideal constructs. Real-life variables are virtually never normally distributed (far from it), and assumptions are leaps of faith. In addition, some parametric tests are often fairly (nothing to do with the Evil Queen!) resistant to the violation of the normality assumption, provided that other assumptions are strong enough and sample sizes are not too different across compared groups. But we will set that aside for a later blog post.

For now, let's just assume that normality matters.

Why tests of normality are problematic

There are dozens of tests of normality and they have their own mathematical make-up and statistical power. Yet, they are all built around the same null hypothesis that the tested data is compatible with a sampling from a normal distribution. A statistically-significant p-value means that if this null hypothesis is indeed true, only a small proportion (corresponding to the p-value) of an infinite number of theoretical similar samples would display a shift from normality of magnitude at least the one observed in the data. Fine.

The typical context of normality testing in biomedical methods is a (very) small sample size, typically n = 5 to 10 units, with Kolmogorov-Smirnov, Shapiro-Wilk, and Anderson-Darling tests being the most often seen. When the test returns a statistically non-significant result, the normality check is declared complete by the authors.

However, this reasoning is flawed. Why? Because in small samples, tests of normality are globally underpowered. This means that the tests would hardly return statistically-significant results, even in case the data was indeed sampled from a non-normal population. The figure below shows simulations (implemented in R) of 1000 repeated tests with n = 15, drawn from uniform, Laplace and normal distributions (these parent distributions are highlighted with a grey rectangle). For each distribution, four tests of normality are interrogated: Shapiro-Wilk, Anderson-Darling, Jarque-Bera, and Kolmogorov-Smirnov. Naively, we would expect a very strong right skew for the eight upper distributions where tests were used on data from uniform and Laplace variables, indicating that a majority of tests returned small p-values. In parallel, we would expect flat distributions for the bottom ones where the normal variable was sampled.

That is certainly not what we get. The resulting p-value distributions spread remarkably between zero and one, even though the generating distributions are clearly non normal in two of the three cases!

We observe some limited right skew at times, reflecting a marginally higher propensity, particularly for Shapiro-Wilk and Anderson-Darling tests, to return p-values below the threshold set at 0.05 (red vertical lines). This effect exists, it is weak, and far from sufficient to provide a reliable decision rule in the long run. Kolmogorov-Smirnov and Jarque-Bera tests perform particularly poorly, their power failure is blatant.

This over-dominance of non significant p-values has two dire consequences. On the one hand, in their isolated experiment, researchers will usually be fooled into non-rejecting the null hypothesis even in case of genuinely strong deviations from normality. On the other hand, the insufficient understanding of null hypothesis significance testing in the life science community will push incorrect interpretation of this lack of statistical significance as evidence for normality. The absence of evidence against the null implies that the test is unable to detect deviations at the given sample size, certainly not that there is evidence of compatibility with normality.

Hence my main discontent and defiance towards these tests.

The limitations of tests of normality extend even further.

It is worth emphasising the scale of the above-described issue. Each of the thousands of simulated samples above contained 15 observations. In many areas of experimental biology, such designs already qualify as relatively large samples. To illustrate that point (with a bit of self-promotion), the figure below shows the sample size distribution in the specific field of biochemical research that employs immunoblotting, it is from a scoping review I published in 2022.

At down-to-earth sample sizes of 3 to 5 units instead of 15, although I did not simulate it here, it is more than likely that even radical deviations from population normality will almost never be spotted in the data by any tests beyond random chance. The visible consequence in publications is that, when tested for normality, the data leads the authors to virtually always validate the assumption and proceed to parametric testing. Rightly or wrongly so? We will never know. But the whole procedure was a pure waste of time.

Finally, the problems with tests of normality are multiple and extend beyond underpowering. For example, in very large samples, they tend to reject the null hypothesis for infinitesimal and biologically inconsequential deviations from the likeliness of normal parent population. Why? Simply because p-values are strongly dependent on sample size (they shrink as samples increase in size).

Getting back to basics before testing

It all starts with two fundamental questions. What is the message supposed to be conveyed by normality testing in the study? Is normality testing really needed? Perhaps you think the answers are obvious, given the introduction paragraphs above. Well, not so fast!

Cutting off any suspense here, virtually no natural variables in life sciences are normally distributed, meaning that splitting hair to find perfect matching of the data with normality is illusory. The notion of enough compatibility is important. It means that the variables and data must have properties that permit the appropriate use of subsequent tests with minimum bias and control of type 1 errors. No less, but no more either.

In terms of message, the process pertains to interrogating the data to determine if it is compatible enough with a scenario of sampling from a normal population to proceed with parametric methodologies. Under another angle: the distribution of the variable in the population may not be perfectly Gaussian, but isn’t it close enough for parametric testing of the data?

With that in mind, ad hoc testing is not inevitably required since prior knowledge on the variables, logical judgement, or qualitative inspection may already do the job quite decently. In that respect, quantile-quantile (Q-Q) plots are often considered the gold standard for evaluating normality (Pleil 2016). Yet, they are often difficult to interpret, with a relatively intense learning curve since the deviation from the theoretical distribution that indicates lack of fit is sometimes unclear. Other quantile-based methods exist, although less known, such as the Aidor-Noiman procedure.

Careful with the temptation of the central limit theorem!

The central limit theorem (CLT) is important in statistics, and it may be expressed in different ways. One relatively intuitive formulation says that if the sample size is large enough, the sampling distribution of the sample mean will closely approximate a normal distribution. This normal sampling distribution is then centered on the population mean and has a standard deviation (standard error of the mean) equal to the population standard deviation divided by the squared sample size. The important point here is that the CLT holds for any original parent distribution.

In the figure below, I simulated (in R) the sampling distributions of a uniformly-shaped population (upper panel). The three lower panels show the distributions of means for samples with n=1 (unique observation), n=30, and n=100. We see that, although noise remains, the distribution converges to normality as samples increase in size, although the population has a uniform structure.

Although the CLT helps solve some issues linked to parametric testing (I will not go into too much detail here), great caution is recommended when invoking it.

First of all, it is bound by strict conditions that the observations are independent and identically distributed (ie.: with the same probability distribution and mutually independent), random, and with finite variance. Some distributions do not have a finite variance, such as Cauchy or Pareto distributions. If they describe the population structure, then the CLT will be invalid. Secondly, the notion of large enough samples is not precisely defined and depends strongly on the original probability distribution. The magic number n=30 often found in textbooks (or ebooks nowadays) makes no real sense since it ignores the actual population structure. Data with 10 units may be sufficient to call in the CLT if sampled from a light-tailed symmetrical population, whereas hundreds of observations might not suffice for some variables. It also decontextualises the need for more or less strict adherence to normality across projects depending on the subsequent inferential methods to be used.

Furthermore, the CLT solves only parts of the normality problem. For example, when using a Student’s t-test, the population normality is required so that the test statistic has a t-distribution, which needs both that the standardized mean difference has a standard normal distribution (with mean of 0 and standard deviation of 1, often noted N(0,1)), but also that the ratio of sample variance over the population variance is Chi2-distributed. The CLT brings a solution for the former requirement, but the latter is only met if the population is strictly normal.

Finally, the CLT does not work equally well for all estimates. For example, the sampling distribution of the Pearson' correlation coefficient is skewed, possibly leading to biased inference thereof. See a good LindedIn post from Joachim Schork on this matter here.

Conclusion: when are tests of normality acceptable?

Having all of the above in mind, yes, tests of normality may be used as a complement to qualitative methods, and never as a standalone check. But I recommend the additional following precepts, based on the logic that the process must aim at challenging the normality assumption, not to validate it:

Use any statistically-significant results to conclude that data with at least the sample observed deviation from normality would be rare in the context of a true normality assumption, but if the test returns a non-significant p-value, it must be considered inconclusive. This is particularly true with small samples.
Use the most powerful tests available such as Shapiro-Wilk or Anderson-Darling, and avoid the Kolmogorov-Smirnov test as much as possible.

Let’s close this chapter with this quote attributed to the famous statistician George Box: “All models are wrong, but some are useful”. The normality assumption makes no exception.

References

The original LinkedIn post: https://www.linkedin.com/posts/rdgbioethics_science-statistics-research-activity-7417460611682476032-S2XC?utm_source=share&utm_medium=member_desktop&rcm=ACoAAACub0MBIuemKA9IhZSNxdlwKx79FTkway4

Gosselin RD (2022): A snapshot of statistical methods used in experimental immunoblotting: a scoping review https://doi.org/10.1051/fopen/2022009

Pleil JD (2016): QQ-plots for assessing distributions of biomarker measurements and generating defensible summary statistics https://doi.org/10.1088/1752-7155/10/3/035001

Aldor-Noiman S et al. (2013): The Power to See: A New Graphical Test of Normality https://doi.org/10.1080/00031305.2013.847865

LinkedIn post on the CLT in correlation: https://www.linkedin.com/posts/joachim-schork_statistical-dataviz-businessanalyst-activity-7429932315528732672--Lbh?utm_source=share&utm_medium=member_desktop&rcm=ACoAAACub0MBIuemKA9IhZSNxdlwKx79FTkway4

Box GEP (1979): Robustness in the Strategy of Scientific Model Building https://doi.org/10.1016/B978-0-12-438150-6.50018-2

Banner and final images (and the poem!) created by Google Gemini Banana Pro, text 100% written by RDG.

Stat chit-chat: why I (almost) never use tests of normality, and do not recommend them.