Hodges-Lehmann-Sen shift and shift confidence interval estimators

by Andrey Akinshin · 2022-05-31

In the previous two posts (1, 2), I discussed the Hodges-Lehmann median estimator. The suggested idea of getting median estimations based on a cartesian product could be adopted to estimate the shift between two samples. In this post, we discuss how to build Hodges-Lehmann shift estimator and how to get confidence intervals for the obtained estimations. Also, we perform a simulation study that checks the actual coverage percentage of these intervals.

Hodges-Lehmann-Sen shift estimator

Let’s consider two samples $x = \{ x_1, x_2, \ldots, x_n, \}$ and $y = \{ y_1, y_2, \ldots, y_m \}$. The Hodges-Lehmann-Sen shift is defined as follows:

$$ \hat{\Delta} = \underset{1 \leq i \leq n;\; 1 \leq j \leq m}{\operatorname{median}}\Big(y_j - x_i\Big). $$

It was suggested in hodges1963 and sen1963.

Confidence Interval

Once we get a shift estimation $\hat{\Delta}$, we may want to also get a confidence interval for this estimation. hodges1963 doesn’t contain any mention of confidence intervals. sen1963 includes a discussion about confidence interval, but it’s quite vague. Recently, I have found a nice straightforward approach for getting CIs (see deshpande2018, 7.11.2). Let’s calculate all the pairwise differences $y_j-x_i$, sort them and denote the result as $d = \{ d_1, d_2, \ldots, d_{nm} \}$. Next, we define the required confidence interval with confidence level $\alpha$ as $[d_l, d_u]$ where

$$ l = \Big[ \frac{nm}{2} - z_{\alpha/2}\sqrt{\frac{nm(n+m+1)}{12}} - 0.5 \Big], $$$$ u = \Big[ \frac{nm}{2} + z_{\alpha/2}\sqrt{\frac{nm(n+m+1)}{12}} - 0.5 \Big]. $$

Simulation Study

Now it’s time to check the above equations. Let’s enumerate different distributions, sample sizes, confidence levels, and check the actual coverage percentage in each situation.

Here is the source code of this simulation:

library(evd)
library(EnvStats)
library(stringr)

d.unif <- list(title = "Uniform(a=0, b=1)", r = runif, q = qunif)
d.tri_0_2_1 <- list(title = "Triangular(a=0, b=2, c=1)", r = function(n) rtri(n, 0, 2, 1), q = function(p) qtri(p, 0, 2, 1))
d.tri_0_2_02 <- list(title = "Triangular(a=0, b=2, c=0.2)", r = function(n) rtri(n, 0, 2, 0.2), q = function(p) qtri(p, 0, 2, 0.2))
d.beta2_4 <- list(title = "Beta(a=2, b=4)", r = function(n) rbeta(n, 2, 4), q = function(p) qbeta(p, 2, 4))
d.beta2_10 <- list(title = "Beta(a=2, b=10)", r = function(n) rbeta(n, 2, 10), q = function(p) qbeta(p, 2, 10))

d.norm <- list(title = "Normal(m=0, sd=1)", r = rnorm, q = qnorm)
d.weibull1_2 <- list(title = "Weibull(scale=1, shape=2)", r = function(n) rweibull(n, 2), q = function(p) qweibull(p, 2))
d.student3 <- list(title = "Student(df=3)", r = function(n) rt(n, 3), q = function(p) qt(p, 3))
d.gumbel <- list(title = "Gumbel(loc=0, scale=1)", r = rgumbel, q = qgumbel)
d.exp <- list(title = "Exp(rate=1)", r = rexp, q = qexp)

d.cauchy <- list(title = "Cauchy(x0=0, gamma=1)", r = rcauchy, q = qcauchy)
d.pareto1_05 <- list(title = "Pareto(loc=1, shape=0.5)", r = function(n) rpareto(n, 1, 0.5), q = function(p) qpareto(p, 1, 0.5))
d.pareto1_2 <- list(title = "Pareto(loc=1, shape=2)", r = function(n) rpareto(n, 1, 2), q = function(p) qpareto(p, 1, 2))
d.lnorm0_1 <- list(title = "LogNormal(mlog=0, sdlog=1)", r = function(n) rlnorm(n, 0, 1), q = function(p) qlnorm(p, 0, 1))
d.lnorm0_2 <- list(title = "LogNormal(mlog=0, sdlog=2)", r = function(n) rlnorm(n, 0, 2), q = function(p) qlnorm(p, 0, 2))

d.lnorm0_3 <- list(title = "LogNormal(mlog=0, sdlog=3)", r = function(n) rlnorm(n, 0, 3), q = function(p) qlnorm(p, 0, 3))
d.weibull1_03 <- list(title = "Weibull(shape=0.3)", r = function(n) rweibull(n, 0.3), q = function(p) qweibull(p, 0.3))
d.weibull1_05 <- list(title = "Weibull(shape=0.5)", r = function(n) rweibull(n, 0.5), q = function(p) qweibull(p, 0.5))
d.frechet1 <- list(title = "Frechet(shape=1)", r = function(n) rfrechet(n, shape = 1), q = function(p) qfrechet(p, shape = 1))
d.frechet3 <- list(title = "Frechet(shape=3)", r = function(n) rfrechet(n, shape = 3), q = function(p) qfrechet(p, shape = 3))

ds <- list(
  d.unif, d.tri_0_2_1, d.tri_0_2_02, d.beta2_4, d.beta2_10,
  d.norm, d.weibull1_2, d.student3, d.gumbel, d.exp,
  d.cauchy, d.pareto1_05, d.pareto1_2, d.lnorm0_1, d.lnorm0_2,
  d.lnorm0_3, d.weibull1_03, d.weibull1_05, d.frechet1, d.frechet3
)

hls <- function(x, y, alpha = 0.95) {
  df <- expand.grid(x = x, y = y)
  d <- sort(df$y - df$x)
  shift <- mean(d)
  n <- length(x)
  m <- length(y)
  A <- n * m / 2 - 0.5
  z <- qnorm((1 + alpha) / 2)
  B <- z * sqrt(n * m * (n + m + 1) / 12)
  l <- max(round(A - B), 1)
  r <- min(round(A + B), length(d))
  list(shift = shift, l = d[l], r = d[r])
}
coverage <- function(rdist, n, true.shift, alpha = 0.95, iterations = 10000) {
  cover.cnt <- 0
  for (iter in 1:iterations) {
    est <- hls(rdist(10), rdist(10) + true.shift, alpha)
    if (est$l < true.shift && true.shift < est$r)
      cover.cnt <- cover.cnt + 1
  }
  cover.cnt / iterations
}

for (n in c(5, 10, 50))
  for (alpha in c(0.90, 0.95, 0.99)) {
    cat(paste0("*** n = ", n, ", alpha = ", alpha * 100, "% ***\n"))
    for (d in ds) {
      c <- coverage(d$r, n, 5, alpha) * 100
      cat(paste0(str_pad(d$title, 30, "right"), ": ", str_pad(format(c, nsmall = 2), 6, "left"), "%\n"))
    }
    cat("\n")
  }

And here are the results:

*** n = 5, alpha = 90% ***
Uniform(a=0, b=1)             :  89.65%
Triangular(a=0, b=2, c=1)     :  89.20%
Triangular(a=0, b=2, c=0.2)   :  89.00%
Beta(a=2, b=4)                :  89.72%
Beta(a=2, b=10)               :  90.32%
Normal(m=0, sd=1)             :  88.78%
Weibull(scale=1, shape=2)     :  89.24%
Student(df=3)                 :  89.56%
Gumbel(loc=0, scale=1)        :  89.32%
Exp(rate=1)                   :  89.52%
Cauchy(x0=0, gamma=1)         :  89.16%
Pareto(loc=1, shape=0.5)      :  89.19%
Pareto(loc=1, shape=2)        :  89.29%
LogNormal(mlog=0, sdlog=1)    :  89.53%
LogNormal(mlog=0, sdlog=2)    :  89.18%
LogNormal(mlog=0, sdlog=3)    :  89.06%
Weibull(shape=0.3)            :  89.56%
Weibull(shape=0.5)            :  88.83%
Frechet(shape=1)              :  89.42%
Frechet(shape=3)              :  89.63%

*** n = 5, alpha = 95% ***
Uniform(a=0, b=1)             :  94.72%
Triangular(a=0, b=2, c=1)     :  94.93%
Triangular(a=0, b=2, c=0.2)   :  94.58%
Beta(a=2, b=4)                :  94.56%
Beta(a=2, b=10)               :  94.80%
Normal(m=0, sd=1)             :  94.36%
Weibull(scale=1, shape=2)     :  95.07%
Student(df=3)                 :  94.73%
Gumbel(loc=0, scale=1)        :  95.03%
Exp(rate=1)                   :  94.71%
Cauchy(x0=0, gamma=1)         :  94.38%
Pareto(loc=1, shape=0.5)      :  94.64%
Pareto(loc=1, shape=2)        :  94.58%
LogNormal(mlog=0, sdlog=1)    :  94.56%
LogNormal(mlog=0, sdlog=2)    :  94.90%
LogNormal(mlog=0, sdlog=3)    :  94.90%
Weibull(shape=0.3)            :  94.56%
Weibull(shape=0.5)            :  94.43%
Frechet(shape=1)              :  94.88%
Frechet(shape=3)              :  94.52%

*** n = 5, alpha = 99% ***
Uniform(a=0, b=1)             :  99.21%
Triangular(a=0, b=2, c=1)     :  99.28%
Triangular(a=0, b=2, c=0.2)   :  99.22%
Beta(a=2, b=4)                :  99.38%
Beta(a=2, b=10)               :  99.17%
Normal(m=0, sd=1)             :  99.13%
Weibull(scale=1, shape=2)     :  99.23%
Student(df=3)                 :  99.35%
Gumbel(loc=0, scale=1)        :  99.23%
Exp(rate=1)                   :  99.29%
Cauchy(x0=0, gamma=1)         :  99.37%
Pareto(loc=1, shape=0.5)      :  99.30%
Pareto(loc=1, shape=2)        :  99.41%
LogNormal(mlog=0, sdlog=1)    :  99.21%
LogNormal(mlog=0, sdlog=2)    :  99.29%
LogNormal(mlog=0, sdlog=3)    :  99.34%
Weibull(shape=0.3)            :  99.21%
Weibull(shape=0.5)            :  99.29%
Frechet(shape=1)              :  99.31%
Frechet(shape=3)              :  99.30%

*** n = 10, alpha = 90% ***
Uniform(a=0, b=1)             :  89.33%
Triangular(a=0, b=2, c=1)     :  89.36%
Triangular(a=0, b=2, c=0.2)   :  89.77%
Beta(a=2, b=4)                :  88.51%
Beta(a=2, b=10)               :  89.07%
Normal(m=0, sd=1)             :  89.58%
Weibull(scale=1, shape=2)     :  89.43%
Student(df=3)                 :  89.19%
Gumbel(loc=0, scale=1)        :  89.45%
Exp(rate=1)                   :  88.99%
Cauchy(x0=0, gamma=1)         :  90.34%
Pareto(loc=1, shape=0.5)      :  89.42%
Pareto(loc=1, shape=2)        :  89.38%
LogNormal(mlog=0, sdlog=1)    :  89.37%
LogNormal(mlog=0, sdlog=2)    :  89.34%
LogNormal(mlog=0, sdlog=3)    :  89.42%
Weibull(shape=0.3)            :  89.59%
Weibull(shape=0.5)            :  89.25%
Frechet(shape=1)              :  89.60%
Frechet(shape=3)              :  89.56%

*** n = 10, alpha = 95% ***
Uniform(a=0, b=1)             :  94.85%
Triangular(a=0, b=2, c=1)     :  94.79%
Triangular(a=0, b=2, c=0.2)   :  94.51%
Beta(a=2, b=4)                :  95.02%
Beta(a=2, b=10)               :  94.71%
Normal(m=0, sd=1)             :  94.85%
Weibull(scale=1, shape=2)     :  95.04%
Student(df=3)                 :  94.71%
Gumbel(loc=0, scale=1)        :  94.63%
Exp(rate=1)                   :  94.72%
Cauchy(x0=0, gamma=1)         :  94.47%
Pareto(loc=1, shape=0.5)      :  94.28%
Pareto(loc=1, shape=2)        :  94.96%
LogNormal(mlog=0, sdlog=1)    :  94.75%
LogNormal(mlog=0, sdlog=2)    :  94.49%
LogNormal(mlog=0, sdlog=3)    :  94.41%
Weibull(shape=0.3)            :  94.71%
Weibull(shape=0.5)            :  94.83%
Frechet(shape=1)              :  94.66%
Frechet(shape=3)              :  94.69%

*** n = 10, alpha = 99% ***
Uniform(a=0, b=1)             :  99.25%
Triangular(a=0, b=2, c=1)     :  99.25%
Triangular(a=0, b=2, c=0.2)   :  99.24%
Beta(a=2, b=4)                :  99.20%
Beta(a=2, b=10)               :  99.21%
Normal(m=0, sd=1)             :  99.31%
Weibull(scale=1, shape=2)     :  99.21%
Student(df=3)                 :  99.36%
Gumbel(loc=0, scale=1)        :  99.49%
Exp(rate=1)                   :  99.19%
Cauchy(x0=0, gamma=1)         :  99.41%
Pareto(loc=1, shape=0.5)      :  99.19%
Pareto(loc=1, shape=2)        :  99.28%
LogNormal(mlog=0, sdlog=1)    :  99.30%
LogNormal(mlog=0, sdlog=2)    :  99.31%
LogNormal(mlog=0, sdlog=3)    :  99.39%
Weibull(shape=0.3)            :  99.20%
Weibull(shape=0.5)            :  99.36%
Frechet(shape=1)              :  99.28%
Frechet(shape=3)              :  99.34%

*** n = 50, alpha = 90% ***
Uniform(a=0, b=1)             :  89.62%
Triangular(a=0, b=2, c=1)     :  88.94%
Triangular(a=0, b=2, c=0.2)   :  89.03%
Beta(a=2, b=4)                :  89.16%
Beta(a=2, b=10)               :  89.71%
Normal(m=0, sd=1)             :  89.14%
Weibull(scale=1, shape=2)     :  89.53%
Student(df=3)                 :  89.10%
Gumbel(loc=0, scale=1)        :  88.51%
Exp(rate=1)                   :  89.23%
Cauchy(x0=0, gamma=1)         :  89.14%
Pareto(loc=1, shape=0.5)      :  89.55%
Pareto(loc=1, shape=2)        :  89.27%
LogNormal(mlog=0, sdlog=1)    :  89.54%
LogNormal(mlog=0, sdlog=2)    :  89.12%
LogNormal(mlog=0, sdlog=3)    :  89.69%
Weibull(shape=0.3)            :  89.15%
Weibull(shape=0.5)            :  89.03%
Frechet(shape=1)              :  89.55%
Frechet(shape=3)              :  88.43%

*** n = 50, alpha = 95% ***
Uniform(a=0, b=1)             :  94.86%
Triangular(a=0, b=2, c=1)     :  94.54%
Triangular(a=0, b=2, c=0.2)   :  94.32%
Beta(a=2, b=4)                :  94.63%
Beta(a=2, b=10)               :  94.58%
Normal(m=0, sd=1)             :  94.97%
Weibull(scale=1, shape=2)     :  94.22%
Student(df=3)                 :  94.85%
Gumbel(loc=0, scale=1)        :  94.51%
Exp(rate=1)                   :  94.06%
Cauchy(x0=0, gamma=1)         :  94.57%
Pareto(loc=1, shape=0.5)      :  94.67%
Pareto(loc=1, shape=2)        :  94.84%
LogNormal(mlog=0, sdlog=1)    :  94.46%
LogNormal(mlog=0, sdlog=2)    :  94.60%
LogNormal(mlog=0, sdlog=3)    :  95.00%
Weibull(shape=0.3)            :  94.68%
Weibull(shape=0.5)            :  94.49%
Frechet(shape=1)              :  94.75%
Frechet(shape=3)              :  94.69%

*** n = 50, alpha = 99% ***
Uniform(a=0, b=1)             :  99.24%
Triangular(a=0, b=2, c=1)     :  99.11%
Triangular(a=0, b=2, c=0.2)   :  99.34%
Beta(a=2, b=4)                :  99.30%
Beta(a=2, b=10)               :  99.26%
Normal(m=0, sd=1)             :  99.21%
Weibull(scale=1, shape=2)     :  99.28%
Student(df=3)                 :  99.31%
Gumbel(loc=0, scale=1)        :  99.27%
Exp(rate=1)                   :  99.28%
Cauchy(x0=0, gamma=1)         :  99.31%
Pareto(loc=1, shape=0.5)      :  99.41%
Pareto(loc=1, shape=2)        :  99.40%
LogNormal(mlog=0, sdlog=1)    :  99.20%
LogNormal(mlog=0, sdlog=2)    :  99.34%
LogNormal(mlog=0, sdlog=3)    :  99.24%
Weibull(shape=0.3)            :  99.30%
Weibull(shape=0.5)            :  99.37%
Frechet(shape=1)              :  99.33%
Frechet(shape=3)              :  99.36%

It seems that the approach is accurate enough, the coverage percentage is quite close to the requested confidence level.