When R's Mann-Whitney U test returns extremely distorted p-values


The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes \(n\) and \(m\), the minimum p-value we can observe with the one-tailed Mann–Whitney U test is \(1/C_{n+m}^n\). For example, if \(n=50\) and \(m=5\), we get \(1/C_{55}^5 \approx 0.0000002874587\). Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is \(\approx 0.0001337028\), which is \(\approx 465\) times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

In fact, the wilcox.test has two implementations of the Mann–Whitney U test: the exact one and the approximated one. The choice of a particular strategy is controlled by the exact parameter of this function. Here is an example from the official documentation:

> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
[1] 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value
[1] 0.1223118

In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about \(5\%\). When the exact parameter is not specified, it is determined as follows:

if(is.null(exact))
    exact <- (n.x < 50) && (n.y < 50)

Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test.

The first step of the Mann–Whitney U test is calculating the U statistic:

\[U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad S(a,b) = \begin{cases} 1, & \text{if } a > b, \\ 0.5, & \text{if } a = b, \\ 0, & \text{if } a < b. \end{cases} \]

It is a simple rank-based statistic, which varies from \(0\) to \(nm\). Once we get the value of \(U\), we should match it with the distribution of \(U\) statistics under the valid null hypothesis. The distribution can be obtained in two different ways:

  • The exact implementation
    The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given \(U\) statistic for the given \(n\) and \(m\). The only drawback of this solution is its computational complexity (which is \(\mathcal{O}(n^2 m^2)\) for both time and memory). This makes this approach an impractical solution for large samples.
  • The approximated implementation
    Fortunately, we have a method of large sample approximation (see [Hollander1973], pages 68–75). Long story short, we approximate the distribution of \(U\) statistics with the following normal distribution (in the case of tied values, a small correction is needed):

\[\mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right) \]

For the middle values of \(U\) (around \(nm/2\)), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study:

  • Enumerate \(n\) from \(1\) to \(50\), set \(m\) to \(50\) (since \(50\) is a critical value, after which wilcox.text switches to the approximated algorithm);
  • For each pair of \(n,\, m\), we consider two non-overlapped samples \(x = \{ 101, 102, \ldots, 100 + n \}\) and \(y = \{ 1, 2, \ldots, m \}\);
  • For each pair of samples \(x\) and \(y\), we estimate the exact and approximated p-values using wilcox.test (R 4.3.0). We also evaluate the ratio between two obtained p-values.

This simulation can be performed using the following simple script:

options(scipen = 999)
df <- expand.grid(n = 1:50, m = 50)
df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value)
df$exact <-  sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value)
df$ratio <- df$approx / df$exact
df

Here are the results:

nmapproxexactratio
1500.0480115431319581981162160.0196078431372549016886708272977557502.448589
2500.0092523043272633009176390.00075414781297134241434126922243308412.268556
3500.0020686025947267974375850.00004268761205498164098062763294194648.459084
4500.0005074257127105121769830.000003162045337406047683961830482846160.473889
5500.0001337027774379267812920.000000287458667036913421002024661768465.120008
6500.0000374197218987324037920.0000000307991428968121560316140140031214.959846
7500.0000110443910632000547420.0000000037823508820646503198782514732919.980564
8500.0000034200972248077333400.0000000005217035699399518039134059176555.633164
9500.0000011068060287062272590.00000000007958190049931467428569925013907.760706
10500.0000003731141424923649270.00000000001326365008321911291947892028130.577944
11500.0000001306675164853229130.00000000000239180575271164350536516154631.324612
12500.0000000474269828479870030.000000000000462930145686124477766408102449.545120
13500.0000000178037349508737930.000000000000095525268157454267935196186377.230803
14500.0000000068994536566987800.000000000000020896152409443119138672330178.184075
15500.0000000027555177950880920.000000000000004822189017563796906354571424.675609
16500.0000000011324204412042990.000000000000001169015519409405310631968695.814899
17500.0000000004782045165170260.0000000000000002966158780591029268441612201.341500
18500.0000000002072305665908060.0000000000000000785159677215272199642639342.959203
19500.0000000000920454044477940.0000000000000000216203389378118452334257352.519429
20500.0000000000418571009403440.0000000000000000061772396965176703976776020.196196
21500.0000000000194669632401580.00000000000000000182707089615311359510654738.839717
22500.0000000000092504323741390.00000000000000000055827166271345138716569768.791735
23500.0000000000044870407948280.00000000000000000017589381153985451825509941.228442
24500.0000000000022198102312910.00000000000000000005704664158049339438912198.330882
25500.0000000000011191167993570.00000000000000000001901554719349779858852726.559442
26500.0000000000005745197903040.00000000000000000000650531877672292488315393.914190
27500.0000000000003001163360430.000000000000000000002281085804824923131567315.621484
28500.0000000000001594159377930.000000000000000000000818851314552536194682398.330711
29500.0000000000000860497787070.000000000000000000000300590988886374286268657.038994
30500.0000000000000471712144870.000000000000000000000112721620832390418475303.483840
31500.0000000000000262458635400.000000000000000000000043140373404989608382855.044904
32500.0000000000000148136244140.000000000000000000000016835267670240879916179.758513
33500.0000000000000084771652390.0000000000000000000000066935401580471266469616.742985
34500.0000000000000049160047770.0000000000000000000000027092900639711814499245.458048
35500.0000000000000028876216750.0000000000000000000000011155900263412588425503.078257
36500.0000000000000017172698600.0000000000000000000000004669911738173677306886.886681
37500.0000000000000010335232400.0000000000000000000000001986054417385203901920.395980
38500.0000000000000006292269190.0000000000000000000000000857614407517336944364.786281
39500.0000000000000003873733980.00000000000000000000000003758085605910307732138.774328
40500.0000000000000002410602280.00000000000000000000000001670260269314432494906.850683
41500.0000000000000001515792310.00000000000000000000000000752534846620142486658.620667
42500.0000000000000000962773560.00000000000000000000000000343548516928024384121.614346
43500.0000000000000000617500570.00000000000000000000000000158845013238874406735.166344
44500.0000000000000000399800980.00000000000000000000000000074352984953770670472.618752
45500.0000000000000000261224550.00000000000000000000000000035219835074169725522.991974
46500.0000000000000000172196190.000000000000000000000000000168761709102035109626.813110
47500.0000000000000000114486300.000000000000000000000000000081771137140008207685.065460
48500.0000000000000000076752020.000000000000000000000000000040051169191634915718.995544
49500.0000000000000000051870810.000000000000000000000000000019823306261665784427.152771
50500.0000000000000000035330360.000000000000000000000000000009911653356452748856.307617

Thus, for \(n=m=50\), the ratio between the exact (the correct one) and approximated (the default one) solutions is about \(3.5 \cdot 10^{11}\). Of course, both p-values are pretty small in this case. If you are using the notorious \(\alpha = 0.05\), you will most likely avoid such a problem (however, for \(n=1,\, m = 50\), the exact true p-value is \(0.0196\), while the approximated one is \(0.0480\), whish is alarmingly close to \(0.05\)).

If you prefer using extremely low statistical significance levels and there are non-zero chances of getting extreme values of the \(U\) statistic, you may consider making exact = TRUE your default way to call wilcox.test. Such a simple trick may save you from misleading research results.

References

  • [Hollander1973]
    Myles Hollander and Douglas A. Wolfe (1973) Nonparametric Statistical Methods. New York: John Wiley & Sons. DOI:10.1002/9781119196037