Trimmed Hodges-Lehmann location estimator, Part 2: Gaussian efficiency


In the previous post, we introduced the trimmed Hodges-Lehman location estimator. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{THL}(\mathbf{x}, k) = \underset{k < i < j \leq n - k}{\operatorname{median}}\biggl(\frac{x_{(i)} + x_{(j)}}{2}\biggr). $$

We also derived the exact expression for its asymptotic and finite-sample breakdown point values. In this post, we explore its Gaussian efficiency.

Read more


Trimmed Hodges-Lehmann location estimator, Part 1: breakdown point


For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, the Hodges-Lehmann location estimator is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{i < j}{\operatorname{median}}\biggl(\frac{x_i + x_j}{2}\biggr). $$

Its asymptotic Gaussian efficiency is $\approx 96\%$, while its asymptotic breakdown point is $\approx 29\%$. This makes the Hodges-Lehmann location estimator a decent robust alternative to the mean.

While the Gaussian efficiency is quite impressive (almost as efficient as the mean), the breakdown point is not as great as in the case of the median (which has a breakdown point of $50\%$). Could we change this trade-off a little bit and make this estimator more robust, sacrificing a small portion of efficiency? Yes, we can!

In this post, I want to present the idea of the trimmed Hodges-Lehmann location estimator and provide the exact equation for its breakdown point.

Read more


Median of the shifts vs. shift of the medians, Part 2: Gaussian efficiency


In the previous post, we discussed the difference between shifts of the medians and the Hodges-Lehmann location shift estimator. In this post, we conduct a simple numerical simulation to evaluate the Gaussian efficiency of these two estimators.

Read more


Median of the shifts vs. shift of the medians, Part 1


Let us say that we have two samples $x = \{ x_1, x_2, \ldots, x_n \}$, $y = \{ y_1, y_2, \ldots, y_m \}$, and we want to estimate the shift of locations between them. In the case of the normal distribution, this task is quite simple and has a lot of straightforward solutions. However, in the nonparametric case, the location shift is an ambiguous metric which heavily depends on the chosen estimator. In the context of this post, we consider two approaches that may look similar. The first one is the shift of the medians:

$$ \newcommand{\DSM}{\Delta_{\operatorname{SM}}} \DSM = \operatorname{median}(y) - \operatorname{median}(x). $$

The second one of the median of all pairwise shifts, also known as the Hodges-Lehmann location shift estimator:

$$ \newcommand{\DHL}{\Delta_{\operatorname{HL}}} \DHL = \operatorname{median}(y_j - x_i). $$

In the case of the normal distributions, these estimators are consistent. However, this post will show an example of multimodal distributions that lead to opposite signs of $\DSM$ and $\DHL$.

Read more


Resistance to the low-density regions: the Hodges-Lehmann location estimator


In the previous posts, I discussed the concept of a resistance function that shows the sensitivity of the given estimator to the low-density regions. I already showed how this function behaves for the mean, the sample median, and the Harrell-Davis median. In this post, I explore this function for the Hodges-Lehmann location estimator.

Read more


Kernel density estimation boundary correction: reflection (ggplot2 v3.4.0)


Kernel density estimation (KDE) is a popular way to approximate a distribution based on the given data. However, it has several flaws. One of the most significant flaws is that it extends the support of the distribution. It is pretty unfortunate: even if we know the actual range of supported values, KDE provides non-zero density values for the regions where no values exist. It is obviously an inaccurate estimation. The procedure of adjusting the KDE values according to the given boundaries is known as boundary correction. As usual, there are plenty of available boundary correction strategies.

One such strategy was implemented in the v3.4.0 update of ggplot2 (a popular R package for plotting) thanks to pull request #4013. At the present moment, it supports a single boundary correction strategy called reflection. In this post, we discuss this approach and see how it works in practice.

Read more


Sheather & Jones vs. unbiased cross-validation


In the post about the importance of kernel density estimation bandwidth, I reviewed several bandwidth selectors and showed their impact on the KDE. The classic selectors like Scott’s rule of thumb or Silverman’s rule of thumb are designed for the normal distribution and perform purely in non-parametric cases. One of the most significant caveats is that they can mask multimodality. The same problem is also relevant to the biased cross-validation method. Among all the bandwidth selectors available in R, only Sheather & Jones and unbiased cross-validation provide reliable results in the multimodal case. However, I always advocate using the Sheather & Jones method rather than the unbiased cross-validation approach.

In this post, I will show the drawbacks of the unbiased cross-validation method and what kind of problems we can get if we use it as a KDE bandwidth selector.

Read more


Resistance to the low-density regions: the Harrell-Davis median


In the previous post, we defined the resistance function that show sensitivity of the given estimator to the low-density regions. We also showed the resistance function plots for the mean and the sample median. In this post, we explore corresponding plots for the Harrell-Davis median.

Read more


Resistance to the low-density regions: the mean and the median


When we discuss resistant statistics, we typically assume resistance to extreme values. However, extreme values are not the only problem source that can violate usual assumptions about expected metric distribution. The low-density regions which often arise in multimodal distributions can also corrupt the results of the statistical analysis. In this post, I discuss this problem and introduce a measure of resistance to low-density regions.

Read more


Finite-sample Gaussian efficiency of the trimmed Harrell-Davis median estimator


In the previous post, we obtained the finite-sample Gaussian efficiency values of the sample median and the Harrell-Davis median. In this post, we extended these results and get the finite-sample Gaussian efficiency values of the trimmed Harrell-Davis median estimator based on the highest density interval of the width $1/\sqrt{n}$.

Read more