Unobvious limitations of R *signrank Wilcoxon Signed Rank functions

Andrey Akinshin · 2023-07-11

In R, we have functions to calculate the density, distribution function, and quantile function of the Wilcoxon Signed Rank statistic distribution: dsignrank, psignrank, and qsignrank. All the functions use exact calculations of the target functions (the R 4.3.1 implementation can be found here). The exact approach works excellently for small sample sizes. Unfortunately, for large sample sizes, it fails to provide the expected function values. Out of the box, there are no alternative approximation solutions that could allow us to get reasonable results. In this post, we investigate the limitations of these functions and provide sample size thresholds after which we might get invalid results.

psignrank

psignrank returns values of the Wilcoxon Signed Rank statistic cumulative distribution function (CDF). It has the following signature:

psignrank(q, n, lower.tail = TRUE, log.p = FALSE)

The Statistic value takes values between $0$ and $n(n+1)/2$. For example, for $n=4$, the Statistic values range from $0$ to $10$:

psignrank(-1:10, 4)

## [1] 0.0000 0.0625 0.1250 0.1875 0.3125 0.4375 0.5625 0.6875 0.8125 0.8750 0.9375 1.0000

Problems appear when $n \geq 1039$. Let us consider $n=1039$ to illustrate this problem. For this sample size, the Statistic values range from $0$ to $540280$. Let us explore the psignrank results for some of these values using the following snippet:

n <- 1039
q <- c(
  0,
  seq(230000, 262000, by = 2000),
  262633,
  262634,
  270140,
  270141,
  277645,
  277646,
  seq(278000, 310000, by = 2000),
  540280
)
cdf <- psignrank(q, n)
cbind(q, cdf)

Here are the results:

qcdfqcdf
00.00000270141-Inf
2300000.00002277645-Inf
2320000.000042776460.78101
2340000.000092780000.79166
2360000.000212800000.84587
2380000.000442820000.88983
2400000.000912840000.92399
2420000.001802860000.94942
2440000.003432880000.96757
2460000.006282900000.97997
2480000.011042920000.98810
2500000.018672940000.99319
2520000.030392960000.99626
2540000.047642980000.99802
2560000.071973000000.99900
2580000.104833020000.99951
2600000.147383040000.99977
2620000.200173060000.99990
2626330.218993080000.99996
262634Inf3100000.99998
270140Inf5402801.00000

If we continue exploring in-between values, we obtain four intervals in the function domain:

qcdf
[0; 262633][0.00000; 0.21899]
[262634; 270140]$\infty$
[270141; 277645]$-\infty$
[277646; 540280][0.78101; 1.00000]

As we can see, in the middle part of the distribution ($q \in [262634; 277645]$), psignrank returns $\pm \infty$. For larger values of $n$, we have the same pattern: psignrank returns correct values only at the distribution tails. The middle part of the distribution gives invalid values. For $1039 \leq n \leq 1074$ these invalid values are $\pm \infty$. Starting $n \geq 1075$, we observe NaN.

dsignrank

dsignrank returns values of the Wilcoxon Signed Rank statistic probability density function (PDF). It has the following signature:

dsignrank(x, n, log = FALSE)

It works fine for $n \leq 1038$. However, for $n \geq 1039$, we encounter issues similar to psignrank:

n <- 1039
x <- c(262633, 262634, 277646, 277647)
dsignrank(x, n)

## [1] 0.00003051712           Inf           Inf 0.00003051712

For the middle part of the distribution, dsignrank always returns Inf. The range of invalid values almost matches the corresponding range for psignrank.

qsignrank

qsignrank returns values of the Wilcoxon Signed Rank statistic quantile function values. It has the following signature:

qsignrank(p, n, lower.tail = TRUE, log.p = FALSE)

It performs normally for $n \leq 1074$, but it hangs for $n \geq 1075$. For example, the following function call never returns a result:

qsignrank(0.5, 1075)

The hanging behavior is observed for $p \in (0; 1); n \geq 1075$.

Conclusion

The signrank functions in R are reliable for small sample sizes, but they have some unobvious limitations when dealing with larger sample sizes. If one needs values that cannot be obtained using standard functions, various approximation models like the normal one or the Edgeworth expansion can be considered.