When R's Mann-Whitney U test returns extremely distorted p-values

by Andrey Akinshin · 2023-04-25

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes $n$ and $m$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $1/C_{n+m}^n$. For example, if $n=50$ and $m=5$, we get $1/C_{55}^5 \approx 0.0000002874587$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is $\approx 0.0001337028$, which is $\approx 465$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

In fact, the wilcox.test has two implementations of the Mann–Whitney U test: the exact one and the approximated one. The choice of a particular strategy is controlled by the exact parameter of this function. Here is an example from the official documentation:

> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
[1] 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value
[1] 0.1223118

In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about $5\%$. When the exact parameter is not specified, it is determined as follows:

if(is.null(exact))
    exact <- (n.x < 50) && (n.y < 50)

Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test.

The first step of the Mann–Whitney U test is calculating the U statistic:

$$ U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad S(a,b) = \begin{cases} 1, & \text{if } a > b, \\ 0.5, & \text{if } a = b, \\ 0, & \text{if } a < b. \end{cases} $$

It is a simple rank-based statistic, which varies from $0$ to $nm$. Once we get the value of $U$, we should match it with the distribution of $U$ statistics under the valid null hypothesis. The distribution can be obtained in two different ways:

  • The exact implementation
    The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given $U$ statistic for the given $n$ and $m$. The only drawback of this solution is its computational complexity (which is $\mathcal{O}(n^2 m^2)$ for both time and memory). This makes this approach an impractical solution for large samples.
  • The approximated implementation
    Fortunately, we have a method of large sample approximation (see [Hollander1973], pages 68–75). Long story short, we approximate the distribution of $U$ statistics with the following normal distribution (in the case of tied values, a small correction is needed):
$$ \mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right) $$

For the middle values of $U$ (around $nm/2$), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study:

  • Enumerate $n$ from $1$ to $50$, set $m$ to $50$ (since $50$ is a critical value, after which wilcox.text switches to the approximated algorithm);
  • For each pair of $n,\, m$, we consider two non-overlapped samples $x = \{ 101, 102, \ldots, 100 + n \}$ and $y = \{ 1, 2, \ldots, m \}$;
  • For each pair of samples $x$ and $y$, we estimate the exact and approximated p-values using wilcox.test (R 4.3.0). We also evaluate the ratio between two obtained p-values.

This simulation can be performed using the following simple script:

options(scipen = 999)
df <- expand.grid(n = 1:50, m = 50)
df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value)
df$exact <-  sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value)
df$ratio <- df$approx / df$exact
df

Here are the results:

nmapproxexactratio
1500.0480115431319581981162160.0196078431372549016886708272977557502.448589
2500.0092523043272633009176390.00075414781297134241434126922243308412.268556
3500.0020686025947267974375850.00004268761205498164098062763294194648.459084
4500.0005074257127105121769830.000003162045337406047683961830482846160.473889
5500.0001337027774379267812920.000000287458667036913421002024661768465.120008
6500.0000374197218987324037920.0000000307991428968121560316140140031214.959846
7500.0000110443910632000547420.0000000037823508820646503198782514732919.980564
8500.0000034200972248077333400.0000000005217035699399518039134059176555.633164
9500.0000011068060287062272590.00000000007958190049931467428569925013907.760706
10500.0000003731141424923649270.00000000001326365008321911291947892028130.577944
11500.0000001306675164853229130.00000000000239180575271164350536516154631.324612
12500.0000000474269828479870030.000000000000462930145686124477766408102449.545120
13500.0000000178037349508737930.000000000000095525268157454267935196186377.230803
14500.0000000068994536566987800.000000000000020896152409443119138672330178.184075
15500.0000000027555177950880920.000000000000004822189017563796906354571424.675609
16500.0000000011324204412042990.000000000000001169015519409405310631968695.814899
17500.0000000004782045165170260.0000000000000002966158780591029268441612201.341500
18500.0000000002072305665908060.0000000000000000785159677215272199642639342.959203
19500.0000000000920454044477940.0000000000000000216203389378118452334257352.519429
20500.0000000000418571009403440.0000000000000000061772396965176703976776020.196196
21500.0000000000194669632401580.00000000000000000182707089615311359510654738.839717
22500.0000000000092504323741390.00000000000000000055827166271345138716569768.791735
23500.0000000000044870407948280.00000000000000000017589381153985451825509941.228442
24500.0000000000022198102312910.00000000000000000005704664158049339438912198.330882
25500.0000000000011191167993570.00000000000000000001901554719349779858852726.559442
26500.0000000000005745197903040.00000000000000000000650531877672292488315393.914190
27500.0000000000003001163360430.000000000000000000002281085804824923131567315.621484
28500.0000000000001594159377930.000000000000000000000818851314552536194682398.330711
29500.0000000000000860497787070.000000000000000000000300590988886374286268657.038994
30500.0000000000000471712144870.000000000000000000000112721620832390418475303.483840
31500.0000000000000262458635400.000000000000000000000043140373404989608382855.044904
32500.0000000000000148136244140.000000000000000000000016835267670240879916179.758513
33500.0000000000000084771652390.0000000000000000000000066935401580471266469616.742985
34500.0000000000000049160047770.0000000000000000000000027092900639711814499245.458048
35500.0000000000000028876216750.0000000000000000000000011155900263412588425503.078257
36500.0000000000000017172698600.0000000000000000000000004669911738173677306886.886681
37500.0000000000000010335232400.0000000000000000000000001986054417385203901920.395980
38500.0000000000000006292269190.0000000000000000000000000857614407517336944364.786281
39500.0000000000000003873733980.00000000000000000000000003758085605910307732138.774328
40500.0000000000000002410602280.00000000000000000000000001670260269314432494906.850683
41500.0000000000000001515792310.00000000000000000000000000752534846620142486658.620667
42500.0000000000000000962773560.00000000000000000000000000343548516928024384121.614346
43500.0000000000000000617500570.00000000000000000000000000158845013238874406735.166344
44500.0000000000000000399800980.00000000000000000000000000074352984953770670472.618752
45500.0000000000000000261224550.00000000000000000000000000035219835074169725522.991974
46500.0000000000000000172196190.000000000000000000000000000168761709102035109626.813110
47500.0000000000000000114486300.000000000000000000000000000081771137140008207685.065460
48500.0000000000000000076752020.000000000000000000000000000040051169191634915718.995544
49500.0000000000000000051870810.000000000000000000000000000019823306261665784427.152771
50500.0000000000000000035330360.000000000000000000000000000009911653356452748856.307617

Thus, for $n=m=50$, the ratio between the exact (the correct one) and approximated (the default one) solutions is about $3.5 \cdot 10^{11}$. Of course, both p-values are pretty small in this case. If you are using the notorious $\alpha = 0.05$, you will most likely avoid such a problem (however, for $n=1,\, m = 50$, the exact true p-value is $0.0196$, while the approximated one is $0.0480$, whish is alarmingly close to $0.05$).

If you prefer using extremely low statistical significance levels and there are non-zero chances of getting extreme values of the $U$ statistic, you may consider making exact = TRUE your default way to call wilcox.test. Such a simple trick may save you from misleading research results.

References

  • [Hollander1973]
    Myles Hollander and Douglas A. Wolfe (1973) Nonparametric Statistical Methods. New York: John Wiley & Sons. DOI:10.1002/9781119196037