# Unobvious problems of using the R's implementation of the Hodges-Lehmann estimator

The Hodges-Lehmann location estimator (also known as pseudo-median) is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right). $$Essentially, it’s the median of the Walsh (pairwise) averages.

For two samples $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ and $\mathbf{y} = \{ y_1, y_2, \ldots, y_m \}$, we can also consider the Hodges-Lehmann location shift estimator:

$$ \operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right). $$In R, both estimators are available via the wilcox.test function. Here is a usage example:

```
set.seed(1729)
x <- rnorm(2000, 5) # A sample of size 2000 from the normal distribution N(5, 1)
y <- rnorm(2000, 2) # A sample of size 2000 from the normal distribution N(2, 1)
wilcox.test(x, conf.int = TRUE)$estimate
## 5.000984
wilcox.test(y, conf.int = TRUE)$estimate
## 1.969096
wilcox.test(x, y, conf.int = TRUE)$estimate
## 3.031782
```

In most cases, this function works fine. However, there is an unobvious corner case, in which it returns wrong values. In this post, we discuss the underlying problem and provide a correct implementation for the Hodges-Lehmann estimators.

## Problem 1: Zero values

Let us consider the one-sample estimator for a sample that contains exactly one zero element:

```
x <- c(0, 1, 2)
wilcox.test(x, conf.int = TRUE)$estimate
## cannot compute exact confidence interval with zeroes
```

Obviously, for $\mathbb{x} = \{ 0, 1, 2 \}$, the correct pseudo-median value is $1$.
However, `wilcox.test`

returns $1.5$ (a wrong value) and prints a set of warnings about a zero value in the sample.
The problem can be found
in the source of `wilcox.test`

:

```
ZEROES <- any(x == 0)
if(ZEROES)
x <- x[x != 0]
```

For the one-sample case, `wilcox.test`

removes all zero elements from the sample.
This logic is needed to properly perform the Wilcoxon rank-sum test
(also known as the Mann–Whitney U test).
The Hodges-Lehmann estimation is an additional feature of this function.
Unfortunately, this feature is affected by this cleanup of zero values.
Therefore, it actually estimated the pseudo-median
not for $\mathbf{x} = \{ 0, 1, 2 \}$, but for $\mathbf{x}' = \{ 1, 2 \}$ (which is $1.5$).

## Problem 2: Tied values (one sample)

Now let us consider the following sample of four elements with a tie:

$$ \mathbf{x} = \{ -2.12984, -2.12984, 1.1479, -0.4895 \}. $$The true value of the Hodges-Lehmann location estimator is given by:

$$ \begin{align} \operatorname{HL}(\mathbf{x}) & = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right) = \\ & = \Bigl(\operatorname{median} \left\{ x_1 + x_1,\, x_1 + x_2,\, x_1 + x_3,\, x_1 + x_4,\, x_2 + x_2,\, x_2 + x_3,\, x_2 + x_4,\, x_3 + x_3,\, x_3 + x_4,\, x_4 + x_4 \right\} \Bigr) / 2 = \\ & = \Bigl(\operatorname{median} \left\{ -4.25968, -4.25968, -0.98194, -2.61934, -4.25968, -0.98194, -2.61934, 2.2958, 0.6584, -0.979 \right\} \Bigr) / 2 = \\ & = -0.90032. \end{align} $$Now let us look at the calculated result by `wilcox.test`

:

```
x <- c(-2.12984, -2.12984, 1.1479, -0.4895)
wilcox.test(x, conf.int = TRUE)$estimate
## cannot compute exact confidence interval with ties
```

As we can see, the returned value of $-0.6514901$ significantly differs from the expected value of $-0.90032$.
When the sample contains tied values, `wilcox.test`

switched
from the exact implementation of Wilcoxon rank-sum test to the approximated one.
As a side effect, it also uses a peculiar approximation of the Hodges-Lehmann estimator
that leads to another pseudo-median estimation that differs from the explicit equation.

## Problem 3: Tied values (two samples)

When we estimate the location shift between two samples, tied values are also an issue. Let us consider the following samples:

$$ \mathbf{x} = \{ 1.5274454801712, 1.5274454801712, 0.3 \}, $$$$ \mathbf{y} = \{ 3.3, -1.72972619537396 \}. $$The expected value of $\operatorname{HL}(\mathbf{x}, \mathbf{y})$ is $\approx 0.1285858$.
Now let us check the output of `wilcox.test`

:

```
x <- c(1.5274454801712, 1.5274454801712, 0.3)
y <- c(3.3, -1.72972619537396)
wilcox.test(x, y, conf.int = TRUE)$estimate
## 1.503729
```

We have $\approx 0.1285858$ vs. $\approx 1.503729$.
It is a huge difference!
The underlying problem is the same as the previous one: existing of tied values forces `wilcox.test`

to switch
to the approximated algorithm that returns strange results.

## Problem 4: Degenerate samples

When sample ranges are degenerate ($\min(x) = \max(x)$, $\min(y) = \max(y)$),
we get a corner case that is not supported in `wilcox.test`

:

```
x <- c(2, 2)
y <- c(1, 1)
wilcox.test(x, y, conf.int = TRUE)$estimate
## 2: cannot compute confidence interval when all observations are tied
```

While the actual value of $\operatorname{HL}(\mathbf{x}, \mathbf{y})$ is obviously $1$,
`wilcox.test`

fails to provide any result.

## The correct implementation

Given the problems discussed above,
I recommend against using `wilcox.test`

for Hodges-Lehmann estimations due to its numerous corner cases
Instead, I suggest the following simple and reliable implementation that supports all the discussed scenarios:

```
hl <- function(x, y = NULL) {
if (is.null(y)) {
walsh <- outer(x, x, "+") / 2
median(walsh[lower.tri(walsh, diag = TRUE)])
} else {
median(outer(x, y, "-"))
}
}
```

Let us check that it works correctly:

```
x <- c(0, 1, 2)
hl(x)
## [1] 1
x <- c(-2.12984, -2.12984, 1.1479, -0.4895)
hl(x)
```

## [1] -0.90032

x <- c(1.5274454801712, 1.5274454801712, 0.3) y <- c(3.3, -1.72972619537396) hl(x, y)

## [1] 0.1285858

x <- c(2, 2) y <- c(1, 1) hl(x, y)