Four main books on robust statistics


Robust statistics is a practical and pragmatic branch of statistics. If you want to design reliable and trustworthy statistical procedures, the knowledge of robust statistics is essential. Unfortunately, it’s a challenging topic to learn.

In this post, I share my favorite books on robust statistics. I cannot pick my favorite one: each book is good in its own way, and all of them complement each other. I am returning to these books periodically to reinforce and expand my understanding of the topic.

Read more


Multimodal distributions and effect size


When we want to express the difference between two samples or distributions, a popular measure family is the effect sizes based on differences between means (difference family). When the normality assumption is satisfied, this approach works well thanks to classic measures of effect size like Cohen’s d, Glass’ Δ, or Hedges’ g. With slight deviations from normality, robust alternatives may be considered. To build such a measure, it’s enough to upgrade classic measures by replacing the sample mean with a robust measure of central tendency and replacing the standard deviation with a robust measure of dispersion. However, it might not be enough in the case of large deviations from normality. In this post, I briefly discuss the problem of effect size evaluation in the context of multimodal distributions.

Read more


Unobvious limitations of R *signrank Wilcoxon Signed Rank functions


In R, we have functions to calculate the density, distribution function, and quantile function of the Wilcoxon Signed Rank statistic distribution: dsignrank, psignrank, and qsignrank. All the functions use exact calculations of the target functions (the R 4.3.1 implementation can be found here). The exact approach works excellently for small sample sizes. Unfortunately, for large sample sizes, it fails to provide the expected function values. Out of the box, there are no alternative approximation solutions that could allow us to get reasonable results. In this post, we investigate the limitations of these functions and provide sample size thresholds after which we might get invalid results.

Read more


Weighted Mann-Whitney U test, Part 1


Previously, I have discussed how to build weighted versions of various statistical methods. I have already covered weighted versions of various quantile estimators and the Hodges-Lehmann location estimator. Such methods can be useful in various tasks like the support of weighted mixture distributions or exponential smoothing. In this post, I suggest a way to build a weighted version of the Mann-Whitney U test.

Read more


Joining modes of multimodal distributions


Multimodality of distributions is a severe issue in statistical analysis. Comparing two multimodal distributions is a tricky challenge. The degree of this challenge depends on the number of existing modes. Switching from unimodal models to multimodal ones can be a controversial decision, potentially causing more problems than solutions. Hence, if we dare to increase the complexity of the considering models, we should be sure that this is an essential necessity. Even when we confidently detect a truly multimodal distribution, a unimodal model could be an acceptable approximation if it is sufficiently close to the true distribution. The simplicity of a unimodal model may make it preferable, even if it is less accurate. Of course, the research goals should always be taken into account when the particular model choice is being made.

Read more


Understanding the pitfalls of preferring the median over the mean


A common task in mathematical statistics is to aggregate a set of numbers $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ to a single “average” value. Such a value is usually called central tendency. There are multiple measures of central tendency. The most popular one is the arithmetic average or the mean:

$$ \overline{\mathbf{x}} = \left( x_1 + x_2 + \ldots + x_n \right) / n. $$

The mean is so popular not only thanks to its simplicity but also because it provides the best way to estimate the center of the perfect normal distribution. Unfortunately, the mean is not a robust measure. This means that a single extreme value $x_i$ may distort the mean estimation and lead to a non-reproducible value that has nothing in common with the “expected” central tendency. The actual real-life distributions are never normal. They can be pretty close to the normal distribution, but only to a certain extent. Even small deviations from normality may produce occasional extreme outliers, which makes the mean an unreliable measure in the general case.

When people discover the danger of the mean, they start looking for a more robust measure of the central tendency. And the first obvious alternative is the sample median $\tilde{\mathbf{x}}$. The classic sample median is easy to calculate. First, you have to sort the sample. If the sample size $n$ is odd, the median is the middle element in the sorted sample. If $n$ is even, the median is the arithmetic average of the two middle elements in the sorted sample. The median is extremely robust: it provides a reasonable estimate even if almost half of the sample elements are corrupted.

For symmetric distributions (including the normal one), the true values of the mean and the median are the same. Once we discover the high robustness of the median, it may be tempting to always use the median instead of the mean. The median is often perceived as “something like the mean but with high resistance to outliers.” Indeed, what is the point of using the unreliable mean, if the median always provides a safer choice? Should we make the median our default option for the central tendency?

The answer is no. You should beware of any default options in mathematical statistics. All the measures are just tools, and each tool has its limitations and areas of applicability. A mindless transition from the mean to the median, regardless of the underlying distribution, is not a smart move. When we are picking a measure of central tendency to use, the first step should be reviewing the research goals: why do we need a measure of central tendency, and what are we going to do with the result? It’s impossible to make a rational decision on the statistical methods used without a clear understanding of the goals. Next, we should match the goals to the properties of available measures.

There are multiple practical issues with the median, but the most noticeable problem in practice is about its statistical efficiency. Understanding this problem reveals the price of advanced robustness of the median. In this post, we discuss the concept of statistical efficiency, estimate the statistical efficiency of the mean and the median under different distributions, and consider the Hodges-Lehman estimator as a measure of central tendency that provides a better trade-off between robustness and efficiency.

Read more


Introducing the defensive statistics


Normal or approximately normal subjects are less useful objects of research than their pathological counterparts.

— Sigmund Freud, “The Psychopathology of Everyday Life”

In the realm of software development, reliability is crucial. This is especially true when creating systems that automatically analyze performance measurements to maintain optimal application performance. To achieve the desired level of reliability, we need a set of statistical approaches that provide accurate and trustworthy results. These approaches must work even when faced with varying input data sets and multiple violated assumptions, including malformed and corrupted values. In this blog post, I introduce “Defensive Statistics” as an appropriate methodology for tackling this challenge.

Read more


Edgeworth expansion for the Mann-Whitney U test, Part 2: increased accuracy


In the previous post, we showed how the Edgeworth expansion can improve the accuracy of obtained p-values in the Mann-Whitney U test. However, we considered only the Edgeworth expansion to terms of order $1/m$. In this post, we explore how to improve the accuracyk of this approach using the Edgeworth expansion to terms of order $1/m^2$.

Read more


Edgeworth expansion for the Mann-Whitney U test


In previous posts, I have shown a severe drawback of the classic Normal approximation for the Mann-Whitney U test: under certain conditions, can lead to quite substantial p-value errors, distorting the significance level of the test.

In this post, we will explore the potential of the Edgeworth expansion as a more accurate alternative for approximating the distribution of the Mann-Whitney U statistic.

Read more


Confusing tie correction in the classic Mann-Whitney U test implementation


In this post, we discuss the classic implementation of the Mann-Whitney U test for cases in which the considered samples contain tied values. This approach is used the same way in all the popular statistical packages.

Unfortunately, in some situations, this approach produces confusing p-values, which may be surprising for researchers who do not have a deep understanding of ties correction. Moreover, some statistical textbooks argue against the validity of the default tie correction. The controversialness and counterintuitiveness of this approach may become a severe issue which may lead to incorrect experiment design and flawed result interpretation. In order to prevent such problems, it is essential to clearly understand the actual impact of tied observations on the true p-value and the impact of tie correction on the approximated p-value estimation. In this post, we discuss the tie correction for the Mann-Whitney U test and review examples that illustrate potential problems. We also provide examples of the Mann-Whitney U test implementations from popular statistical packages: wilcox.test from stats (R), mannwhitneyu from SciPy (Python), and MannWhitneyUTest from HypothesisTests (Julia). At the end of the post, we discuss how to avoid possible problems related to the tie correction.

Read more