## When Python's Mann-Whitney U test returns extremely distorted p-values

In the previous post, I have discussed a huge difference between p-values evaluated via the R implementation of the Mann-Whitney U test between the exact and asymptotic implementations. This issue is not unique only to R, it is relevant for other statistical packages in other languages as well. In this post, we review this problem in the Python package SciPy.

## When R's Mann-Whitney U test returns extremely distorted p-values

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes $$n$$ and $$m$$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $$1/C_{n+m}^n$$. For example, if $$n=50$$ and $$m=5$$, we get $$1/C_{55}^5 \approx 0.0000002874587$$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")\$p.value
 0.0001337028


The obtained p-value is $$\approx 0.0001337028$$, which is $$\approx 465$$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

## Preprint announcement: 'Weighted quantile estimators'

I have just published a preprint of a paper ‘Weighted quantile estimators’. It’s based on a series of my research notes that I have been writing since September 2020.

The paper preprint is available on arXiv: arXiv:2304.07265 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-wqe. You can cite it as follows:

Abstract:

In this paper, we consider a generic scheme that allows building weighted versions of various quantile estimators, such as traditional quantile estimators based on linear interpolation of two order statistics, the Harrell-Davis quantile estimator and its trimmed modification. The obtained weighted quantile estimators are especially useful in the problem of estimating a distribution at the tail of a time series using quantile exponential smoothing. The presented approach can also be applied to other problems, such as quantile estimation of weighted mixture distributions.

## Rethinking Type I/II error rates with power curves

When it comes to the analysis of a statistical significance test design, many people tend to overfocus purely on the Type I error rate. Those who are aware of the importance of power analysis often stop at expressing the Type II error rate as a single number. It is better than nothing, but such an approach always confuses me.

Let us say that the declared Type II error rate is 20% (or the declared statistical power is 80%). What does it actually mean? If the sample size and the significance level (or any other significance criteria) are given, the Type II error rate is a function of the effect size. When we express the Type II error rate as a single number, we always (implicitly or explicitly) assume the target effect size. In most cases, it is an arbitrary number that is somehow chosen to reflect our expectations of the “reasonable” effect size. However, the actual Type II error rate and the corresponding statistical power depend on the actual effect size that we do not know. Some researchers estimate the Type II error rate / statistical power using the measured effect size, but it does not make a lot of sense since it does not provide new information in addition to the measured effect size or p-value. In reality, we have high statistical power (low Type II error rate) for large effect sizes and low statistical power (high Type II error rate) for small effect sizes. Without the knowledge of the actual effect size (which we do not have), the Type II error rate expressed as a single number mostly describes this arbitrarily chosen expected effect size, rather than the actual properties of our statistical test.

## Adaptation of continuous scale measures to discrete distributions

In statistics, it is often important to have a reliable measure of scale since it is required for estimating many types of the effect size and for statistical tests. If we work with continuous distributions, there are plenty of available scale measures with various levels of statistical efficiency and robustness. However, when distribution becomes discrete (e.g. because of the limited resolution of the measure tools), classic measures of scale can collapse to zero due to tied values in collected samples. This can be a severe problem in the analysis since the scale measures are often used as denominators in various equations. To make the calculations more reliable, it is important to handle such situations somehow and ensure that the target scale measure never becomes zero. In this post, I discuss a simple approach to work around this problem and adapt any given measure of scale to the discrete case.

## Weighted modification of the Hodges-Lehmann location estimator

The classic Hodges-Lehmann location estimator is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $$\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$$, it is defined as follows:

$\operatorname{HL}(\mathbf{x}) = \underset{1 \leq i < j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right).$

This estimator works great for non-weighted samples (its asymptotic Gaussian efficiency is $$\approx 96\%$$, and its asymptotic breakdown point is $$\approx 29\%$$). However, in real-world applications, data points may have varying importance or relevance. For example, in finance, different stocks may have different market capitalizations, which can impact the overall performance of an index. In social science research, survey responses may be weighted based on demographic representation to ensure that the final results are more generalizable. In software performance measurements, the observations may be collected from different source code revisions, some of which may be obsolete. In these cases, the classic $$\operatorname{HL}$$-measure is not suitable, as it treats each data point equally.

We can overcome this problem using weighted samples to obtain more accurate and meaningful central tendency estimates. Unfortunately, there is no well-established definition of the weighted Hodges-Lehmann location estimator. In this blog post, we introduce such a definition so that we can apply this estimator to weighted samples keeping it compatible with the original version.

## Performance stability of GitHub Actions

Nowadays, GitHub Actions is one of the most popular free CI systems. It’s quite convenient to use it to run unit and integration tests. However, some developers try to use it to run benchmarks and performance tests. Unfortunately, default GitHub Actions build agents do not provide a consistent execution environment from the performance point of view. Therefore, performance measurements from different builds can not be compared. This makes it almost impossible to set up reliable performance tests based on the default GitHub Actions build agent pool.

So, it’s expected that the execution environments are not absolutely identical. But how bad is the situation? What’s the maximum difference between performance measurements from different builds? Is there a chance that we can play with thresholds and utilize GitHub Actions to detect at least major performance degradations? Let’s find out!

## p-value distribution of the Brunner–Munzel test in the finite case

In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.

## Comparing statistical power of the Mann-Whitney U test and the Brunner-Munzel test

In this post, we perform a short numerical simulation to compare the statistical power of the Mann-Whitney U test and the Brunner-Munzel test under normality for various sample sizes and significance levels.