# When R's Mann-Whitney U test returns extremely distorted p-values

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes $$n$$ and $$m$$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $$1/C_{n+m}^n$$. For example, if $$n=50$$ and $$m=5$$, we get $$1/C_{55}^5 \approx 0.0000002874587$$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value  0.0001337028  The obtained p-value is $$\approx 0.0001337028$$, which is $$\approx 465$$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out! In fact, the wilcox.test has two implementations of the Mann–Whitney U test: the exact one and the approximated one. The choice of a particular strategy is controlled by the exact parameter of this function. Here is an example from the official documentation: > x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46) > y <- c(1.15, 0.88, 0.90, 0.74, 1.21) > wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value  0.1223118  In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about $$5\%$$. When the exact parameter is not specified, it is determined as follows: if(is.null(exact)) exact <- (n.x < 50) && (n.y < 50)  Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test. The first step of the Mann–Whitney U test is calculating the U statistic: $U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad S(a,b) = \begin{cases} 1, & \text{if } a > b, \\ 0.5, & \text{if } a = b, \\ 0, & \text{if } a < b. \end{cases}$ It is a simple rank-based statistic, which varies from $$0$$ to $$nm$$. Once we get the value of $$U$$, we should match it with the distribution of $$U$$ statistics under the valid null hypothesis. The distribution can be obtained in two different ways: • The exact implementation The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given $$U$$ statistic for the given $$n$$ and $$m$$. The only drawback of this solution is its computational complexity (which is $$\mathcal{O}(n^2 m^2)$$ for both time and memory). This makes this approach an impractical solution for large samples. • The approximated implementation Fortunately, we have a method of large sample approximation (see [Hollander1973], pages 68–75). Long story short, we approximate the distribution of $$U$$ statistics with the following normal distribution (in the case of tied values, a small correction is needed): $\mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right)$ For the middle values of $$U$$ (around $$nm/2$$), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study: • Enumerate $$n$$ from $$1$$ to $$50$$, set $$m$$ to $$50$$ (since $$50$$ is a critical value, after which wilcox.text switches to the approximated algorithm); • For each pair of $$n,\, m$$, we consider two non-overlapped samples $$x = \{ 101, 102, \ldots, 100 + n \}$$ and $$y = \{ 1, 2, \ldots, m \}$$; • For each pair of samples $$x$$ and $$y$$, we estimate the exact and approximated p-values using wilcox.test (R 4.3.0). We also evaluate the ratio between two obtained p-values. This simulation can be performed using the following simple script: options(scipen = 999) df <- expand.grid(n = 1:50, m = 50) df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value) df$exact <-  sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value) df$ratio <- df$approx / df$exact
df


Here are the results:

nmapproxexactratio
1500.0480115431319581981162160.0196078431372549016886708272977557502.448589
2500.0092523043272633009176390.00075414781297134241434126922243308412.268556
3500.0020686025947267974375850.00004268761205498164098062763294194648.459084
4500.0005074257127105121769830.000003162045337406047683961830482846160.473889
5500.0001337027774379267812920.000000287458667036913421002024661768465.120008
6500.0000374197218987324037920.0000000307991428968121560316140140031214.959846
7500.0000110443910632000547420.0000000037823508820646503198782514732919.980564
8500.0000034200972248077333400.0000000005217035699399518039134059176555.633164
9500.0000011068060287062272590.00000000007958190049931467428569925013907.760706
10500.0000003731141424923649270.00000000001326365008321911291947892028130.577944
11500.0000001306675164853229130.00000000000239180575271164350536516154631.324612
12500.0000000474269828479870030.000000000000462930145686124477766408102449.545120
13500.0000000178037349508737930.000000000000095525268157454267935196186377.230803
14500.0000000068994536566987800.000000000000020896152409443119138672330178.184075
15500.0000000027555177950880920.000000000000004822189017563796906354571424.675609
16500.0000000011324204412042990.000000000000001169015519409405310631968695.814899
17500.0000000004782045165170260.0000000000000002966158780591029268441612201.341500
18500.0000000002072305665908060.0000000000000000785159677215272199642639342.959203
19500.0000000000920454044477940.0000000000000000216203389378118452334257352.519429
20500.0000000000418571009403440.0000000000000000061772396965176703976776020.196196
21500.0000000000194669632401580.00000000000000000182707089615311359510654738.839717
22500.0000000000092504323741390.00000000000000000055827166271345138716569768.791735
23500.0000000000044870407948280.00000000000000000017589381153985451825509941.228442
24500.0000000000022198102312910.00000000000000000005704664158049339438912198.330882
25500.0000000000011191167993570.00000000000000000001901554719349779858852726.559442
26500.0000000000005745197903040.00000000000000000000650531877672292488315393.914190
27500.0000000000003001163360430.000000000000000000002281085804824923131567315.621484
28500.0000000000001594159377930.000000000000000000000818851314552536194682398.330711
29500.0000000000000860497787070.000000000000000000000300590988886374286268657.038994
30500.0000000000000471712144870.000000000000000000000112721620832390418475303.483840
31500.0000000000000262458635400.000000000000000000000043140373404989608382855.044904
32500.0000000000000148136244140.000000000000000000000016835267670240879916179.758513
33500.0000000000000084771652390.0000000000000000000000066935401580471266469616.742985
34500.0000000000000049160047770.0000000000000000000000027092900639711814499245.458048
35500.0000000000000028876216750.0000000000000000000000011155900263412588425503.078257
36500.0000000000000017172698600.0000000000000000000000004669911738173677306886.886681
37500.0000000000000010335232400.0000000000000000000000001986054417385203901920.395980
38500.0000000000000006292269190.0000000000000000000000000857614407517336944364.786281
39500.0000000000000003873733980.00000000000000000000000003758085605910307732138.774328
40500.0000000000000002410602280.00000000000000000000000001670260269314432494906.850683
41500.0000000000000001515792310.00000000000000000000000000752534846620142486658.620667
42500.0000000000000000962773560.00000000000000000000000000343548516928024384121.614346
43500.0000000000000000617500570.00000000000000000000000000158845013238874406735.166344
44500.0000000000000000399800980.00000000000000000000000000074352984953770670472.618752
45500.0000000000000000261224550.00000000000000000000000000035219835074169725522.991974
46500.0000000000000000172196190.000000000000000000000000000168761709102035109626.813110
47500.0000000000000000114486300.000000000000000000000000000081771137140008207685.065460
48500.0000000000000000076752020.000000000000000000000000000040051169191634915718.995544
49500.0000000000000000051870810.000000000000000000000000000019823306261665784427.152771
50500.0000000000000000035330360.000000000000000000000000000009911653356452748856.307617

Thus, for $$n=m=50$$, the ratio between the exact (the correct one) and approximated (the default one) solutions is about $$3.5 \cdot 10^{11}$$. Of course, both p-values are pretty small in this case. If you are using the notorious $$\alpha = 0.05$$, you will most likely avoid such a problem (however, for $$n=1,\, m = 50$$, the exact true p-value is $$0.0196$$, while the approximated one is $$0.0480$$, whish is alarmingly close to $$0.05$$).

If you prefer using extremely low statistical significance levels and there are non-zero chances of getting extreme values of the $$U$$ statistic, you may consider making exact = TRUE your default way to call wilcox.test. Such a simple trick may save you from misleading research results.

• [Hollander1973]
Myles Hollander and Douglas A. Wolfe (1973) Nonparametric Statistical Methods. New York: John Wiley & Sons. DOI:10.1002/9781119196037