Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies

Andrey Akinshin · 2021-05-25

Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A paper with final results is available in Communications in Statistics - Simulation and Computation (DOI: 10.1080/03610918.2022.2050396). A preprint is available on arXiv: arXiv:2111.11776 [stat.ME]. Some information in this blog post can be obsolete: please, use the official paper as the primary reference.

Let’s say we want to estimate the median based on a small sample (3 $\leq n \leq 7$) from a right-skewed heavy-tailed distribution with high statistical efficiency.

The traditional median estimator is the most robust estimator, but it’s not the most efficient one. Typically, the Harrell-Davis quantile estimator provides better efficiency, but it’s not robust (its breakdown point is zero), so it may have worse efficiency in the given case. The winsorized and trimmed modifications of the Harrell-Davis quantile estimator provide a good trade-off between efficiency and robustness, but they require a proper winsorizing/trimming rule. A reasonable choice of such a rule for medium-size samples is based on the highest density interval of the Beta function (as described here). Unfortunately, this approach may be suboptimal for small samples. E.g., if we use the 99% highest density interval to estimate the median, it starts to trim sample values only for $n \geq 8$.

In this post, we are going to discuss custom winsorizing/trimming strategies for special cases of the quantile estimation problem.

Beware of small samples

So, what’s wrong with the small samples from heavy-tailed distributions? There is a high chance to observe an extremely large value in such a sample. As an example, let’s consider the Pareto distribution. Here is the PDF for $\textrm{Pareto}(x_m = 1, \alpha = 1)$:

The quantile function for the Pareto distribution is defined as follows:

$$ Q_\textrm{Pareto}(p) = x_m (1 - p)^{(-1/\alpha)} $$

In our case ($x_m = 1, \alpha = 1$), we have:

$$ Q_{\textrm{Pareto}(x_m = 1,\;\alpha = 1)}(p) = \frac{1}{1 - p} $$

Here are some of the quantile values:

q[0.500] = 2
q[0.750] = 4
q[0.900] = 10
q[0.950] = 20
q[0.990] = 100
q[0.999] = 1000

Thus, while the median value is $2$, the value of the $99.9^\textrm{th}$ percentile is 1000. It may look like a small probability, but this impression is deceiving. For example, if we consider 5000 values (or 1000 of 5-element samples), there is a 99.3% chance that at least one of the values is higher than 1000.

The problem of the Harrell-Davis quantile estimator

The Harrell-Davis approach suggests estimating the quantile value as a weighted sum of all sample elements where all weight values are positive. Thus, its breakdown point is zero: a single “extreme” value may “corrupt” the estimation. Let’s say we have the following sample from $\textrm{Pareto}(x_m = 1, \alpha = 1)$:

$$ x = \{ 1, 2, 1000 \} $$

The traditional median estimator just selects the middle element. In our case, it gives

$$ Q_\textrm{Traditional}(x) = x_2 = 2 $$

which is a pretty accurate median estimation. The Harrell-Davis quantile estimator produces a worse estimation:

$$ Q_\textrm{HD}(x) \approx 0.259259\cdot x_1 + 0.481481\cdot x_2 + 0.259259\cdot x_3 \approx 260.481. $$

The Harrell-Davis quantile estimator is a good approach in many cases because it typically provides better efficiency then the traditional one. Unfortunately, in the considered case (small samples form heavy-tailed distribution), its efficiency is quite bad because it’s not robust enough. Let’s try to fix this problem.

Smart winsorizing/trimming

If we want to improve the Harrell-Davis quantile estimator robustness, we can consider its winsorized and trimmed modifications. In one of the previous posts, I described an approach that selects the trimming percentage based on the highest density interval of the corresponding beta function. It works great for medium-size samples, but it’s useless for small samples. If we use the 99% highest density interval to estimate the median, the trimming percentage will be positive only for $n \geq 8$.

In our case, we can consider a custom trimming strategy. Our greatest enemy here is the random extreme values. We want to trim them whenever it’s possible.

Let’s consider a sample $x$ with $n$ elements where all the samples elements are already sorted:

$$ x_1 \leq x_2 \leq \ldots \leq x_n. $$

Now let’s discuss how we should trim the samples for different $n$ values:

$n=3$
It’s not a good idea to trim $x_2$ because it’s our traditional median estimation. It’s also doesn’t make sense to trim $x_1$ because the considered distribution is right-skewed ($x_1$ is the minimum element of the sample; it has the smallest chance to be “extreme”). Thus, the reasonable decision is to trim $x_3$.
$n=4$
The traditional median estimator for samples with even size uses two middle elements. For $n=4$, these elements are $x_2$ and $x_3$, so we don’t want to trim them. It’s still useless to trim $x_1$, so our only option is to trim $x_4$.
$n=5$
Here the traditional median estimator uses only $x_3$. So, we can trim two elements: $x_4$ and $x_5$.
$n=6$
By analogy, we can trim $x_5$ and $x_6$.
$n=7$
Here the traditional median estimator uses only $x_4$. Thus, we can trim three elements: $x_5$, $x_6$, and $x_7$.

This strategy can be described using the following scheme (@ denotes non-trimmed element; x denotes trimmed element; assuming sorted sample):

n = 3: @@x
n = 4: @@@x
n = 5: @@@xx
n = 6: @@@@xx
n = 7: @@@@xxx

It’s time to check how it works.

Numerical simulation

We are going to compare four quantile estimators:

hf7: the traditional quantile estimator (also known as the Hyndman-Fan Type 7)
hd: the Harrell-Davis quantile estimator
whd: the winsorized Harrell-Davis quantile estimator where the winsorized elements are defined according to the scheme from the previous section
thd: the trimmed Harrell-Davis quantile estimator where the trimmed elements are defined according to the scheme from the previous section

Typically, the estimator efficiency is evaluated using the mean squared error (mse) and the bias. This approach works good for light-tailed error distribution, but it could be misleading for heavy-tailed distribution. Thus, in addition to the classic bias and mse metric, we also consider different percentiles of the absolute values of simulated errors. Also, we are going to draw the PDF of the error distributions. Each error distribution is based on 100'000 random samples.

In this simulation, we are going to estimate the median for the normal distribution (the true median value is $0$) and for $\textrm{Pareto}(x_m = 1, \alpha = 1)$ (the true median value is $2$).

Normal distribution (n = 5)

Let’s start with the normal distribution (which is symmetric and light-tailed) to do a dry-check of our estimators.

 estimator       bias     mse     p25     p50     p75     p90     p95    p99
       hf7 -0.0016329 0.29153 0.16804 0.36103 0.61964 0.89926 1.06836 1.3611
        hd -0.0056730 0.21042 0.14671 0.31081 0.52609 0.74722 0.89905 1.1936
       whd -0.1920940 0.28335 0.17178 0.36076 0.60780 0.87269 1.04119 1.3885
       thd -0.2879419 0.32841 0.18283 0.38710 0.65326 0.94229 1.14017 1.4771

TODO: rewrite As we can see, hf7 and hd give pretty similar results. In this particular case, whd and thd give worse results (but they are still pretty acceptable) It’s an expected situation because our winsorizing/trimming strategy is “overfitted” to the right-skewed case while the actual distribution is symmetric.

Pareto distribution (n = 3)

Now it’s time to check our estimators on the most complicated case: we should guess the median value of $\textrm{Pareto}(x_m = 1, \alpha = 1)$ based only on three elements!

 estimator     bias        mse     p25     p50     p75    p90     p95    p99
       hf7  0.96402     14.093 0.33215 0.63298 1.09961 3.1286  5.4061 15.677
        hd 12.75015 175467.942 0.37914 0.84069 3.02464 8.9115 18.5175 90.129
       whd  0.62981     14.147 0.32802 0.61526 0.89961 2.4955  4.2395 11.140
       thd  0.48012     23.503 0.33245 0.61360 0.88120 1.9543  3.4305  9.819

Expectedly, hd has the worst error values because of its zero breakdown point. Meanwhile, whd and thd have better error values than hf7 (which is our baseline). They are better not only based on the “classic” bias and mse metrics but only based on all the percentile values.

Let’s see what happens for higher $n$ values.

Pareto distribution (n = 4)

 estimator    bias      mse     p25     p50     p75    p90     p95    p99
       hf7 0.99630   13.376 0.28927 0.58443 1.22370 3.1813  5.1862 12.015
        hd 4.33302 1376.949 0.35408 0.81399 2.56402 6.8722 12.2750 53.703
       whd 0.95874   25.060 0.28729 0.57546 1.12468 2.9193  4.8311 12.190
       thd 0.72518   11.554 0.27726 0.54553 0.91749 2.3926  4.0997 10.723

Pareto distribution (n = 5)

 estimator     bias        mse     p25     p50     p75    p90    p95     p99
       hf7 0.504146     5.2105 0.26675 0.52613 0.87303 2.1065 3.3679  7.2826
        hd 4.396743 22596.2795 0.31288 0.74502 2.24599 5.3275 9.3600 34.9101
       whd 0.231205     2.1152 0.26279 0.50793 0.77450 1.4745 2.4296  5.8547
       thd 0.067092     1.6300 0.26268 0.50467 0.74328 1.1175 1.9506  4.6478

Pareto distribution (n = 6)

 estimator    bias       mse     p25     p50     p75    p90    p95     p99
       hf7 0.51371    2.5938 0.23690 0.48757 0.86252 2.0152 3.0778  6.4690
        hd 2.46898 1195.1400 0.29120 0.66492 1.78711 4.0332 6.8272 20.9587
       whd 0.42000    2.5518 0.23282 0.47884 0.79029 1.7981 2.8478  6.0907
       thd 0.25985    1.4532 0.23638 0.46981 0.74189 1.4778 2.3190  4.9017

Pareto distribution (n = 7)

 estimator      bias        mse     p25     p50     p75    p90    p95     p99
       hf7  0.337121    1.85904 0.23519 0.46158 0.75627 1.5838 2.4504  5.3569
        hd  1.975857 1989.76770 0.26722 0.60452 1.52104 3.2602 5.0468 13.5435
       whd  0.082059    0.88772 0.23247 0.44718 0.68616 1.0932 1.7480  3.8571
       thd -0.039714    0.68217 0.23089 0.45429 0.67313 0.8878 1.3837  3.0663

Conclusion

In this post, we performed a numerical simulation that compared the statistical efficiency of the median estimation on small samples from right-skewed heavy-tailed distributions. We compared four quantile estimators: the traditional one (also known as the Hyndman-Fan Type 7), the Harrell-Davis quantile estimator, and its winsorized/trimming modifications that use a custom trimming strategy optimized for the given case. Based on the simulation results, we can conclude that the trimmed modification is the clear winner.