The Mann–Whitney U test
(also known as the Wilcoxon rank-sum test)
is one of the most popular nonparametric statistical tests.
In R, it can be accessed using
the wilcox.test function,
which has been available
since R 1.0.0 (February 2000).
With its extensive adoption and long-standing presence in R,
the `wilcox.test`

function has become a trusted tool for many researchers.
But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes \(n\) and \(m\), the minimum p-value we can observe with the one-tailed Mann–Whitney U test is \(1/C_{n+m}^n\). For example, if \(n=50\) and \(m=5\), we get \(1/C_{55}^5 \approx 0.0000002874587\). Let’s check these calculations using R:

```
> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028
```

The obtained p-value is \(\approx 0.0001337028\), which is \(\approx 465\) times larger than we expected!
Have we discovered a critical bug in `wilcox.test`

?
Can we now trust this function?
Let’s find out!

In fact, the `wilcox.test`

has two implementations of the Mann–Whitney U test: the exact one and the approximated one.
The choice of a particular strategy is controlled by the `exact`

parameter of this function.
Here is an example from the official documentation:

```
> x <- c(0.80, 0.83, 1.89, 1.04, 1.45, 1.38, 1.91, 1.64, 0.73, 1.46)
> y <- c(1.15, 0.88, 0.90, 0.74, 1.21)
> wilcox.test(x, y, alternative = "greater", exact = TRUE)$p.value
[1] 0.1272061
> wilcox.test(x, y, alternative = "greater", exact = FALSE)$p.value
[1] 0.1223118
```

In this example, the approximation looks accurate, with a difference ofhis makes it an impra only about \(5\%\).
When the `exact`

parameter is not specified, it is determined
as follows:

```
if(is.null(exact))
exact <- (n.x < 50) && (n.y < 50)
```

Before we start exploring different extreme cases, let us briefly review both methods to estimate the p-value within this test.

The first step of the Mann–Whitney U test is calculating the U statistic:

\[U(x, y) = \sum_{i=1}^n \sum_{j=1}^m S(x_i, y_j),\quad S(a,b) = \begin{cases} 1, & \text{if } a > b, \\ 0.5, & \text{if } a = b, \\ 0, & \text{if } a < b. \end{cases} \]

It is a simple rank-based statistic, which varies from \(0\) to \(nm\). Once we get the value of \(U\), we should match it with the distribution of \(U\) statistics under the valid null hypothesis. The distribution can be obtained in two different ways:

**The exact implementation**

The straightforward implementation uses dynamic programming to obtain the exact probability of obtaining the given \(U\) statistic for the given \(n\) and \(m\). The only drawback of this solution is its computational complexity (which is \(\mathcal{O}(n^2 m^2)\) for both time and memory). This makes this approach an impractical solution for large samples.**The approximated implementation**

Fortunately, we have a method of large sample approximation (see [Hollander1973], pages 68–75). Long story short, we approximate the distribution of \(U\) statistics with the following normal distribution (in the case of tied values, a small correction is needed):

\[\mathcal{N}\left( \frac{nm}{2},\, \frac{nm(n+m+1)}{12} \right) \]

For the middle values of \(U\) (around \(nm/2\)), this approximation provides reasonably accurate results. However, it performs extremely poorly at the tails. To illustrate the problems, we perform a simple numerical study:

- Enumerate \(n\) from \(1\) to \(50\), set \(m\) to \(50\)
(since \(50\) is a critical value, after which
`wilcox.text`

switches to the approximated algorithm); - For each pair of \(n,\, m\), we consider two non-overlapped samples \(x = \{ 101, 102, \ldots, 100 + n \}\) and \(y = \{ 1, 2, \ldots, m \}\);
- For each pair of samples \(x\) and \(y\), we estimate the exact and approximated p-values using
`wilcox.test`

(R 4.3.0). We also evaluate the ratio between two obtained p-values.

This simulation can be performed using the following simple script:

```
options(scipen = 999)
df <- expand.grid(n = 1:50, m = 50)
df$approx <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = F)$p.value)
df$exact <- sapply(1:nrow(df), function(i) wilcox.test(101:(100+df$n[i]), 1:df$m[i], "g", exact = T)$p.value)
df$ratio <- df$approx / df$exact
df
```

Here are the results:

n | m | approx | exact | ratio |
---|---|---|---|---|

1 | 50 | 0.048011543131958198116216 | 0.019607843137254901688670827297755750 | 2.448589 |

2 | 50 | 0.009252304327263300917639 | 0.000754147812971342414341269222433084 | 12.268556 |

3 | 50 | 0.002068602594726797437585 | 0.000042687612054981640980627632941946 | 48.459084 |

4 | 50 | 0.000507425712710512176983 | 0.000003162045337406047683961830482846 | 160.473889 |

5 | 50 | 0.000133702777437926781292 | 0.000000287458667036913421002024661768 | 465.120008 |

6 | 50 | 0.000037419721898732403792 | 0.000000030799142896812156031614014003 | 1214.959846 |

7 | 50 | 0.000011044391063200054742 | 0.000000003782350882064650319878251473 | 2919.980564 |

8 | 50 | 0.000003420097224807733340 | 0.000000000521703569939951803913405917 | 6555.633164 |

9 | 50 | 0.000001106806028706227259 | 0.000000000079581900499314674285699250 | 13907.760706 |

10 | 50 | 0.000000373114142492364927 | 0.000000000013263650083219112919478920 | 28130.577944 |

11 | 50 | 0.000000130667516485322913 | 0.000000000002391805752711643505365161 | 54631.324612 |

12 | 50 | 0.000000047426982847987003 | 0.000000000000462930145686124477766408 | 102449.545120 |

13 | 50 | 0.000000017803734950873793 | 0.000000000000095525268157454267935196 | 186377.230803 |

14 | 50 | 0.000000006899453656698780 | 0.000000000000020896152409443119138672 | 330178.184075 |

15 | 50 | 0.000000002755517795088092 | 0.000000000000004822189017563796906354 | 571424.675609 |

16 | 50 | 0.000000001132420441204299 | 0.000000000000001169015519409405310631 | 968695.814899 |

17 | 50 | 0.000000000478204516517026 | 0.000000000000000296615878059102926844 | 1612201.341500 |

18 | 50 | 0.000000000207230566590806 | 0.000000000000000078515967721527219964 | 2639342.959203 |

19 | 50 | 0.000000000092045404447794 | 0.000000000000000021620338937811845233 | 4257352.519429 |

20 | 50 | 0.000000000041857100940344 | 0.000000000000000006177239696517670397 | 6776020.196196 |

21 | 50 | 0.000000000019466963240158 | 0.000000000000000001827070896153113595 | 10654738.839717 |

22 | 50 | 0.000000000009250432374139 | 0.000000000000000000558271662713451387 | 16569768.791735 |

23 | 50 | 0.000000000004487040794828 | 0.000000000000000000175893811539854518 | 25509941.228442 |

24 | 50 | 0.000000000002219810231291 | 0.000000000000000000057046641580493394 | 38912198.330882 |

25 | 50 | 0.000000000001119116799357 | 0.000000000000000000019015547193497798 | 58852726.559442 |

26 | 50 | 0.000000000000574519790304 | 0.000000000000000000006505318776722924 | 88315393.914190 |

27 | 50 | 0.000000000000300116336043 | 0.000000000000000000002281085804824923 | 131567315.621484 |

28 | 50 | 0.000000000000159415937793 | 0.000000000000000000000818851314552536 | 194682398.330711 |

29 | 50 | 0.000000000000086049778707 | 0.000000000000000000000300590988886374 | 286268657.038994 |

30 | 50 | 0.000000000000047171214487 | 0.000000000000000000000112721620832390 | 418475303.483840 |

31 | 50 | 0.000000000000026245863540 | 0.000000000000000000000043140373404989 | 608382855.044904 |

32 | 50 | 0.000000000000014813624414 | 0.000000000000000000000016835267670240 | 879916179.758513 |

33 | 50 | 0.000000000000008477165239 | 0.000000000000000000000006693540158047 | 1266469616.742985 |

34 | 50 | 0.000000000000004916004777 | 0.000000000000000000000002709290063971 | 1814499245.458048 |

35 | 50 | 0.000000000000002887621675 | 0.000000000000000000000001115590026341 | 2588425503.078257 |

36 | 50 | 0.000000000000001717269860 | 0.000000000000000000000000466991173817 | 3677306886.886681 |

37 | 50 | 0.000000000000001033523240 | 0.000000000000000000000000198605441738 | 5203901920.395980 |

38 | 50 | 0.000000000000000629226919 | 0.000000000000000000000000085761440751 | 7336944364.786281 |

39 | 50 | 0.000000000000000387373398 | 0.000000000000000000000000037580856059 | 10307732138.774328 |

40 | 50 | 0.000000000000000241060228 | 0.000000000000000000000000016702602693 | 14432494906.850683 |

41 | 50 | 0.000000000000000151579231 | 0.000000000000000000000000007525348466 | 20142486658.620667 |

42 | 50 | 0.000000000000000096277356 | 0.000000000000000000000000003435485169 | 28024384121.614346 |

43 | 50 | 0.000000000000000061750057 | 0.000000000000000000000000001588450132 | 38874406735.166344 |

44 | 50 | 0.000000000000000039980098 | 0.000000000000000000000000000743529849 | 53770670472.618752 |

45 | 50 | 0.000000000000000026122455 | 0.000000000000000000000000000352198350 | 74169725522.991974 |

46 | 50 | 0.000000000000000017219619 | 0.000000000000000000000000000168761709 | 102035109626.813110 |

47 | 50 | 0.000000000000000011448630 | 0.000000000000000000000000000081771137 | 140008207685.065460 |

48 | 50 | 0.000000000000000007675202 | 0.000000000000000000000000000040051169 | 191634915718.995544 |

49 | 50 | 0.000000000000000005187081 | 0.000000000000000000000000000019823306 | 261665784427.152771 |

50 | 50 | 0.000000000000000003533036 | 0.000000000000000000000000000009911653 | 356452748856.307617 |

Thus, for \(n=m=50\), the ratio between the exact (the correct one) and approximated (the default one) solutions is about \(3.5 \cdot 10^{11}\). Of course, both p-values are pretty small in this case. If you are using the notorious \(\alpha = 0.05\), you will most likely avoid such a problem (however, for \(n=1,\, m = 50\), the exact true p-value is \(0.0196\), while the approximated one is \(0.0480\), whish is alarmingly close to \(0.05\)).

If you prefer using extremely low statistical significance levels
and there are non-zero chances of getting extreme values of the \(U\) statistic,
you may consider making `exact = TRUE`

your default way to call `wilcox.test`

.
Such a simple trick may save you from misleading research results.

### References

**[Hollander1973]**

Myles Hollander and Douglas A. Wolfe (1973) Nonparametric Statistical Methods. New York: John Wiley & Sons. DOI:10.1002/9781119196037