Confusing tie correction in the classic Mann-Whitney U test implementation


In this post, we discuss the classic implementation of the Mann-Whitney U test for cases in which the considered samples contain tied values. This approach is used the same way in all the popular statistical packages.

Unfortunately, in some situations, this approach produces confusing p-values, which may be surprising for researchers who do not have a deep understanding of ties correction. Moreover, some statistical textbooks argue against the validity of the default tie correction. The controversialness and counterintuitiveness of this approach may become a severe issue which may lead to incorrect experiment design and flawed result interpretation. In order to prevent such problems, it is essential to clearly understand the actual impact of tied observations on the true p-value and the impact of tie correction on the approximated p-value estimation. In this post, we discuss the tie correction for the Mann-Whitney U test and review examples that illustrate potential problems. We also provide examples of the Mann-Whitney U test implementations from popular statistical packages: wilcox.test from stats (R), mannwhitneyu from SciPy (Python), and MannWhitneyUTest from HypothesisTests (Julia). At the end of the post, we discuss how to avoid possible problems related to the tie correction.

Read more


Efficiency of the central tendency measures under the uniform distribution


Statistical efficiency is one of the primary ways to compare various estimators. Since the normality assumption is often used, Gaussian efficiency (efficiency under the normality distribution) is typically considered. For example, the asymptotic Gaussian efficiency values of the median and the Hodges-Lehmann location estimator (the pseudo-median) are $\approx 64\%$ and $\approx 96\%$ respectively (assuming the baseline is the mean).

But what if the underlying distribution is not normal, but uniform? What would happen to the relative statistical efficiency values in this case? Let’s find out! In this post, we calculate the relative efficiency of the median, the Hodges-Lehmann location estimator, and the midrange to the mean under the uniform distribution (or under uniformity).

Read more


Unobvious problems of using the R's implementation of the Hodges-Lehmann estimator


The Hodges-Lehmann location estimator (also known as pseudo-median) is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right). $$

Essentially, it’s the median of the Walsh (pairwise) averages.

For two samples $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ and $\mathbf{y} = \{ y_1, y_2, \ldots, y_m \}$, we can also consider the Hodges-Lehmann location shift estimator:

$$ \operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right). $$

In R, both estimators are available via the wilcox.test function. Here is a usage example:

set.seed(1729)
x <- rnorm(2000, 5) # A sample of size 2000 from the normal distribution N(5, 1)
y <- rnorm(2000, 2) # A sample of size 2000 from the normal distribution N(2, 1)
wilcox.test(x, conf.int = TRUE)$estimate
# (pseudo)median
#       5.000984
wilcox.test(y, conf.int = TRUE)$estimate
# (pseudo)median
#       1.969096
wilcox.test(x, y, conf.int = TRUE)$estimate
# difference in location
#               3.031782

In most cases, this function works fine. However, there is an unobvious corner case, in which it returns wrong values. In this post, we discuss the underlying problem and provide a correct implementation for the Hodges-Lehmann estimators.

Read more


When Python's Mann-Whitney U test returns extremely distorted p-values


In the previous post, I have discussed a huge difference between p-values evaluated via the R implementation of the Mann-Whitney U test between the exact and asymptotic implementations. This issue is not unique only to R, it is relevant for other statistical packages in other languages as well. In this post, we review this problem in the Python package SciPy.

Read more


When R's Mann-Whitney U test returns extremely distorted p-values


The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes $n$ and $m$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $1/C_{n+m}^n$. For example, if $n=50$ and $m=5$, we get $1/C_{55}^5 \approx 0.0000002874587$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is $\approx 0.0001337028$, which is $\approx 465$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

Read more


Preprint announcement: 'Weighted quantile estimators'


I have just published a preprint of a paper ‘Weighted quantile estimators’. It’s based on a series of my research notes that I have been writing since September 2020.

The paper preprint is available on arXiv: arXiv:2304.07265 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-wqe. You can cite it as follows:

Abstract:

In this paper, we consider a generic scheme that allows building weighted versions of various quantile estimators, such as traditional quantile estimators based on linear interpolation of two order statistics, the Harrell-Davis quantile estimator and its trimmed modification. The obtained weighted quantile estimators are especially useful in the problem of estimating a distribution at the tail of a time series using quantile exponential smoothing. The presented approach can also be applied to other problems, such as quantile estimation of weighted mixture distributions.

Read more


Rethinking Type I/II error rates with power curves


When it comes to the analysis of a statistical significance test design, many people tend to overfocus purely on the Type I error rate. Those who are aware of the importance of power analysis often stop at expressing the Type II error rate as a single number. It is better than nothing, but such an approach always confuses me.

Let us say that the declared Type II error rate is 20% (or the declared statistical power is 80%). What does it actually mean? If the sample size and the significance level (or any other significance criteria) are given, the Type II error rate is a function of the effect size. When we express the Type II error rate as a single number, we always (implicitly or explicitly) assume the target effect size. In most cases, it is an arbitrary number that is somehow chosen to reflect our expectations of the “reasonable” effect size. However, the actual Type II error rate and the corresponding statistical power depend on the actual effect size that we do not know. Some researchers estimate the Type II error rate / statistical power using the measured effect size, but it does not make a lot of sense since it does not provide new information in addition to the measured effect size or p-value. In reality, we have high statistical power (low Type II error rate) for large effect sizes and low statistical power (high Type II error rate) for small effect sizes. Without the knowledge of the actual effect size (which we do not have), the Type II error rate expressed as a single number mostly describes this arbitrarily chosen expected effect size, rather than the actual properties of our statistical test.

Read more


Adaptation of continuous scale measures to discrete distributions


In statistics, it is often important to have a reliable measure of scale since it is required for estimating many types of the effect size and for statistical tests. If we work with continuous distributions, there are plenty of available scale measures with various levels of statistical efficiency and robustness. However, when distribution becomes discrete (e.g. because of the limited resolution of the measure tools), classic measures of scale can collapse to zero due to tied values in collected samples. This can be a severe problem in the analysis since the scale measures are often used as denominators in various equations. To make the calculations more reliable, it is important to handle such situations somehow and ensure that the target scale measure never becomes zero. In this post, I discuss a simple approach to work around this problem and adapt any given measure of scale to the discrete case.

Read more


Weighted modification of the Hodges-Lehmann location estimator


The classic Hodges-Lehmann location estimator is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i < j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right). $$

This estimator works great for non-weighted samples (its asymptotic Gaussian efficiency is $\approx 96\%$, and its asymptotic breakdown point is $\approx 29\%$). However, in real-world applications, data points may have varying importance or relevance. For example, in finance, different stocks may have different market capitalizations, which can impact the overall performance of an index. In social science research, survey responses may be weighted based on demographic representation to ensure that the final results are more generalizable. In software performance measurements, the observations may be collected from different source code revisions, some of which may be obsolete. In these cases, the classic $\operatorname{HL}$-measure is not suitable, as it treats each data point equally.

We can overcome this problem using weighted samples to obtain more accurate and meaningful central tendency estimates. Unfortunately, there is no well-established definition of the weighted Hodges-Lehmann location estimator. In this blog post, we introduce such a definition so that we can apply this estimator to weighted samples keeping it compatible with the original version.

Read more