# Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies

Let’s say we want to
**estimate the median**
based on a **small sample** (3 \(\leq n \leq 7\))
from a **right-skewed heavy-tailed distribution**
with **high statistical efficiency**.

The traditional median estimator is the most robust estimator, but it’s not the most efficient one. Typically, the Harrell-Davis quantile estimator provides better efficiency, but it’s not robust (its breakdown point is zero), so it may have worse efficiency in the given case. The winsorized and trimmed modifications of the Harrell-Davis quantile estimator provide a good trade-off between efficiency and robustness, but they require a proper winsorizing/trimming rule. A reasonable choice of such a rule for medium-size samples is based on the highest density interval of the Beta function (as described here). Unfortunately, this approach may be suboptimal for small samples. E.g., if we use the 99% highest density interval to estimate the median, it starts to trim sample values only for \(n \geq 8\).

In this post, we are going to discuss custom winsorizing/trimming strategies for special cases of the quantile estimation problem.

### Beware of small samples

So, what’s wrong with the small samples from heavy-tailed distributions? There is a high chance to observe an extremely large value in such a sample. As an example, let’s consider the Pareto distribution. Here is the PDF for \(\textrm{Pareto}(x_m = 1, \alpha = 1)\):

The quantile function for the Pareto distribution is defined as follows:

\[Q_\textrm{Pareto}(p) = x_m (1 - p)^{(-1/\alpha)} \]

In our case (\(x_m = 1, \alpha = 1\)), we have:

\[Q_{\textrm{Pareto}(x_m = 1,\;\alpha = 1)}(p) = \frac{1}{1 - p} \]

Here are some of the quantile values:

```
q[0.500] = 2
q[0.750] = 4
q[0.900] = 10
q[0.950] = 20
q[0.990] = 100
q[0.999] = 1000
```

Thus, while the median value is \(2\), the value of the \(99.9^\textrm{th}\) percentile is 1000. It may look like a small probability, but this impression is deceiving. For example, if we consider 5000 values (or 1000 of 5-element samples), there is a 99.3% chance that at least one of the values is higher than 1000.

### The problem of the Harrell-Davis quantile estimator

The Harrell-Davis approach suggests estimating the quantile value as a weighted sum of all sample elements where all weight values are positive. Thus, its breakdown point is zero: a single “extreme” value may “corrupt” the estimation. Let’s say we have the following sample from \(\textrm{Pareto}(x_m = 1, \alpha = 1)\):

\[x = \{ 1, 2, 1000 \} \]

The traditional median estimator just selects the middle element. In our case, it gives

\[Q_\textrm{Traditional}(x) = x_2 = 2 \]

which is a pretty accurate median estimation. The Harrell-Davis quantile estimator produces a worse estimation:

\[Q_\textrm{HD}(x) \approx 0.259259\cdot x_1 + 0.481481\cdot x_2 + 0.259259\cdot x_3 \approx 260.481. \]

The Harrell-Davis quantile estimator is a good approach in many cases because it typically provides better efficiency then the traditional one. Unfortunately, in the considered case (small samples form heavy-tailed distribution), its efficiency is quite bad because it’s not robust enough. Let’s try to fix this problem.

### Smart winsorizing/trimming

If we want to improve the Harrell-Davis quantile estimator robustness, we can consider its winsorized and trimmed modifications. In one of the previous posts, I described an approach that selects the trimming percentage based on the highest density interval of the corresponding beta function. It works great for medium-size samples, but it’s useless for small samples. If we use the 99% highest density interval to estimate the median, the trimming percentage will be positive only for \(n \geq 8\).

In our case, we can consider a custom trimming strategy. Our greatest enemy here is the random extreme values. We want to trim them whenever it’s possible.

Let’s consider a sample \(x\) with \(n\) elements where all the samples elements are already sorted:

\[x_1 \leq x_2 \leq \ldots \leq x_n. \]

Now let’s discuss how we should trim the samples for different \(n\) values:

- \(n=3\)

It’s not a good idea to trim \(x_2\) because it’s our traditional median estimation. It’s also doesn’t make sense to trim \(x_1\) because the considered distribution is right-skewed (\(x_1\) is the minimum element of the sample; it has the smallest chance to be “extreme”). Thus, the reasonable decision is to trim \(x_3\). - \(n=4\)

The traditional median estimator for samples with even size uses two middle elements. For \(n=4\), these elements are \(x_2\) and \(x_3\), so we don’t want to trim them. It’s still useless to trim \(x_1\), so our only option is to trim \(x_4\). - \(n=5\)

Here the traditional median estimator uses only \(x_3\). So, we can trim two elements: \(x_4\) and \(x_5\). - \(n=6\)

By analogy, we can trim \(x_5\) and \(x_6\). - \(n=7\)

Here the traditional median estimator uses only \(x_4\). Thus, we can trim three elements: \(x_5\), \(x_6\), and \(x_7\).

This strategy can be described using the following scheme
(`@`

denotes non-trimmed element; `x`

denotes trimmed element; assuming sorted sample):

```
n = 3: @@x
n = 4: @@@x
n = 5: @@@xx
n = 6: @@@@xx
n = 7: @@@@xxx
```

It’s time to check how it works.

### Numerical simulation

We are going to compare four quantile estimators:

`hf7`

: the traditional quantile estimator (also known as the Hyndman-Fan Type 7)`hd`

: the Harrell-Davis quantile estimator`whd`

: the winsorized Harrell-Davis quantile estimator where the winsorized elements are defined according to the scheme from the previous section`thd`

: the trimmed Harrell-Davis quantile estimator where the trimmed elements are defined according to the scheme from the previous section

Typically, the estimator efficiency is evaluated using the mean squared error (`mse`

) and the bias.
This approach works good for light-tailed error distribution,
but it could be misleading for heavy-tailed distribution.
Thus, in addition to the classic bias and mse metric,
we also consider different percentiles of the absolute values of simulated errors.
Also, we are going to draw the PDF of the error distributions.
Each error distribution is based on 100'000 random samples.

In this simulation, we are going to estimate the median for the normal distribution (the true median value is \(0\)) and for \(\textrm{Pareto}(x_m = 1, \alpha = 1)\) (the true median value is \(2\)).

#### Normal distribution (n = 5)

Let’s start with the normal distribution (which is symmetric and light-tailed) to do a dry-check of our estimators.

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 1.9989 4.2806 1.6406 1.9986 2.3571 2.6808 2.8774 3.2498
hd 1.9983 4.2088 1.6835 2.0003 2.3115 2.5946 2.7598 3.0799
whd 2.1971 5.0761 1.8597 2.1935 2.5307 2.8371 3.0237 3.3723
thd 2.2851 5.4642 1.9532 2.2805 2.6133 2.9198 3.1023 3.4480
```

As we can see, `hf7`

and `hd`

give pretty similar results.
In this particular case, `whd`

and `thd`

give worse results (but they are still pretty acceptable)
It’s an expected situation because our winsorizing/trimming strategy is “overfitted” to the right-skewed case
while the actual distribution is symmetric.

#### Pareto distribution (n = 3)

Now it’s time to check our estimators on the most complicated case: we should guess the median value of \(\textrm{Pareto}(x_m = 1, \alpha = 1)\) based only on three elements!

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 1.5123 62.215 0.32717 0.63312 1.06588 3.1005 5.3714 15.368
hd 8.9820 53457.769 0.37791 0.82852 2.94861 8.5900 17.2297 82.489
whd 1.1791 13.033 0.32245 0.61093 0.89666 2.2888 3.9940 11.355
thd 1.0959 10.468 0.32505 0.60616 0.87402 2.0557 3.6763 10.116
```

Expectedly, `hd`

has the worst error values because of its zero breakdown point.
Meanwhile, `whd`

and `thd`

have better error values than `hf7`

(which is our baseline).
They are better not only based on the “classic” `bias`

and `mse`

metrics but only based on all the percentile values.

Let’s see what happens for higher \(n\) values.

#### Pareto distribution (n = 4)

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 1.3685 17.729 0.28709 0.58347 1.2009 3.0084 4.9603 12.553
hd 8.9832 341193.607 0.35421 0.82811 2.6347 6.9242 13.0648 55.666
whd 1.3481 75.375 0.28768 0.57450 1.1052 2.8478 4.7321 12.135
thd 1.2203 159.813 0.27968 0.55621 0.9480 2.5127 4.1917 10.584
```

#### Pareto distribution (n = 5)

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 0.93476 3.6081e+00 0.26418 0.52748 0.85761 2.0465 3.2767 7.5615
hd 6.89114 9.5374e+05 0.32004 0.74949 2.20399 5.3076 9.3283 34.6076
whd 0.75266 1.9533e+00 0.26330 0.51288 0.77389 1.4601 2.3995 5.3516
thd 0.67432 1.3032e+00 0.26495 0.50669 0.75081 1.1632 1.9720 4.4307
```

#### Pareto distribution (n = 6)

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 0.86396 3.1662 0.24138 0.49037 0.84813 1.9343 3.0090 6.4052
hd 2.45124 1146.6008 0.29475 0.67508 1.83344 4.1168 6.7717 21.5434
whd 0.80034 2.5351 0.23669 0.47788 0.78792 1.7473 2.7498 5.8945
thd 0.70381 1.6713 0.23367 0.46480 0.73381 1.4345 2.2906 4.9982
```

#### Pareto distribution (n = 7)

```
estimator bias mse p25 p50 p75 p90 p95 p99
hf7 0.72789 1.7341e+00 0.22666 0.46096 0.75719 1.58315 2.4422 5.0349
hd 3.81677 4.7061e+05 0.25941 0.58495 1.49718 3.19833 4.9741 12.9881
whd 0.59597 9.2458e-01 0.22620 0.44712 0.68831 1.09442 1.7404 3.6767
thd 0.54333 6.5777e-01 0.23242 0.44939 0.67056 0.88692 1.3906 2.9339
```

### Conclusion

In this post, we performed a numerical simulation that compared the statistical efficiency
of the median estimation on small samples from right-skewed heavy-tailed distributions.
We compared four quantile estimators:
the traditional one (also known as the Hyndman-Fan Type 7),
the Harrell-Davis quantile estimator,
and its winsorized/trimming modifications that use a custom trimming strategy optimized for the given case.
Based on the simulation results, we can conclude that the **trimmed modification is the clear winner**.