Efficiency of the Harrell-Davis quantile estimator

Andrey Akinshin · 2021-03-23

Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A paper with final results is available in Communications in Statistics - Simulation and Computation (DOI: 10.1080/03610918.2022.2050396). A preprint is available on arXiv: arXiv:2111.11776 [stat.ME]. Some information in this blog post can be obsolete: please, use the official paper as the primary reference.

One of the most essential properties of a quantile estimator is its efficiency. In simple words, the efficiency describes the estimator accuracy. The Harrell-Davis quantile estimator is a good option to achieve higher efficiency. However, this estimator may provide lower efficiency in some special cases. In this post, we will conduct a set of simulations that show the actual efficiency numbers. We compare different distributions (symmetric and right-skewed, heavy-tailed and light-tailed), quantiles, and sample sizes.

Simulation design

The relative efficiency value depends on five parameters:

Target quantile estimator
Baseline quantile estimator
Estimated quantile $p$
Sample size $n$
Distribution

Our target quantile estimator is the Harrell-Davis quantile estimator or HDQE ( A new distribution-free quantile estimator
By Frank E Harrell, C E Davis · 1982 harrell1982).

$$ Q_{HD}(p) = \sum_{i=1}^{n} W_{i} \cdot x_{(i)}, \quad W_{i} = I_{i/n}(a, b) - I_{(i-1)/n}(a, b), \quad a = p(n+1),\; b = (1-p)(n+1) $$

where $I_t(a, b)$ denotes the regularized incomplete beta function, $x_{(i)}$ is the $i^\textrm{th}$ order statistics.

The conventional baseline quantile estimator in such simulations is the traditional quantile estimator that is defined as a linear combination of two subsequent order statistics. To be more specific, we are going to use the Type 7 quantile estimator from the Hyndman-Fan classification or HF7QE ( Sample Quantiles in Statistical Packages
By Rob J Hyndman, Yanan Fan · 1996 hyndman1996). It can be expressed as follows (assuming one-based indexing):

$$ Q_{HF7}(p) = x_{(\lfloor h \rfloor)}+(h-\lfloor h \rfloor)(x_{(\lfloor h \rfloor+1)})-x_{(\lfloor h \rfloor)},\quad h = (n-1)p+1. $$

Thus, we are going to estimate the relative efficiency of HDQE comparing to the traditional quantile estimator HF7QE. For the $p^\textrm{th}$ quantile, the relative efficiency can be calculated as the ratio of the estimator mean squared errors ($\textrm{MSE}$):

$$ \textrm{Efficiency}(p) = \dfrac{\textrm{MSE}(Q_{HF7}, p)}{\textrm{MSE}(Q_{HD}, p)} = \dfrac{\operatorname{E}[(Q_{HF7}(p) - \theta(p))^2]}{\operatorname{E}[(Q_{HD}(p) - \theta(p))^2]} $$

where $\theta(p)$ is the true value of the $p^\textrm{th}$ quantile. The $\textrm{MSE}$ value depends on the sample size $n$, so it should be calculated independently for each sample size value.

Finally, we should choose the distributions for sample generation. Initially, I wanted to repeat the numerical experiment from A new distribution-free quantile estimator
By Frank E Harrell, C E Davis · 1982 harrell1982 (Section 3). The authors used the generalized lambda distribution which is defined by its quantile function:

$$ F^{-1}(p) = \mu + \sigma (p^a - (1-p)^b) $$

They states that they used $\mu = 0$, $\sigma = 1$, and $a, b$ form the following table:

Case	a	b	Description
(a)	1	1	Light-tailed symmetric
(b)	0.1349	0.1349	Normal-like
(c)	-0.1359	-0.1359	Very heavy-tailed symmetric, like t-distribution with 5 degrees of freedom
(d)	-1	-1	Cauchy-like
(e)	0.0251	0.0953	Medium-tailed asymmetric
(f)	0	0.0004	Exponential-like

Unfortunately, there are problems with this table. Cases (c) and (d) are invalid because we can’t use negative values of $a$ and $b$ with a positive value of $\sigma$: such combination produces decreasing quantile function. Even if we flip the sign of $\sigma$, the suggested distributions are not close to their descriptions. For example, (b) doesn’t look like the normal distribution, (d) doesn’t look like the Cauchy distribution, (f) doesn’t look like the exponential distribution.

So, I decided to build my own set of distributions and evaluate the efficiency of the Harrell-Davis quantile estimator in each case.

For each distribution, we are going to do the following:

Enumerate all the percentiles and calculate the true percentile value $\theta(p)$ for each distribution
Enumerate different sample sizes (from 3 to 60)
Generate a bunch of random samples, estimate the percentile values using two estimators, calculate the relative efficiency of the Harrell-Davis quantile estimator.

Let’s look at the results for different groups of distributions.

Symmetric

We start with symmetric distributions from the following list:

distribution	description
Beta(2,2)	Beta distribution with a=b=2
Beta(3,3)	Beta distribution with a=b=3
Beta(4,4)	Beta distribution with a=b=4
Beta(5,5)	Beta distribution with a=b=4
U(0,1)	Uniform distribution on [0;1]
N(0,1)	Normal with mu=0, sigma=1
DE(0,1)	Laplace (double exponential) with mu=0, b=1
Cauchy(0,1)	Cauchy distribution with location=0, scale=1
T(2)	Student’s t with 2 degrees of freedom
T(3)	Student’s t with 3 degrees of freedom
T(4)	Student’s t with 4 degrees of freedom
T(5)	Student’s t with 5 degrees of freedom

Below you can find the density plots of these distributions and an animation that shows the relative efficiency for different sample sizes.

Here we can make a few observations:

For light-tailed distributions (beta, uniform, normal), the Harrell-Davis quantile estimator is more efficient than the traditional one for all the percentiles.
For heavy-tailed distribution (Laplace, Cauchy, Student), the Harrell-Davis quantile estimator is more efficient in the middle of the distribution and less efficient on the tails.

Unimodal right-skewed

Next, we consider different right skewed distributions: exponential, beta (with $a < b$), Gumbel, Pareto, Weibull, Frechet, log-normal:

On the left side of these plots, the Harrell-Davis quantile estimator is always more efficient than the traditional one. However, it’s not so efficient on the tails. The region of the inefficiency depends on the heaviness of the tail and on the sample size.

In the case of distributions closed to the light-tailed case (exponential, Gumbel(0, 1), beta, Pareto with large $\alpha$, Weibull with the shape parameter greater than 1, Frechet with the shape parameter greater than 10, log-normal with $\sigma < 0.5$), HDQE is quite efficient. For small sample sizes (e.g., $n=3$), it’s efficient for $p \in [0, 0.5]$. For medium-size samples (e.g., $n=60$), it’s efficient for $p \in [0, 0.85]$

In the case of heavy-tailed distributions, HDQE is not so efficient. For extremely heavy-tailed cases (e.g. log-normal with $\sigma = 3$), HDQE could be efficient only for $p \in [0, 0.2]$ on samples of any size.

Bimodal right-skewed

For each distribution from the previous section, let’s build a mixture from the original distribution and the original distribution shifted by 10. Thus, we get a set of bimodal right-skewed distributions:

Here we can observe two inefficient areas. The first one is on the tail, as in the previous section. The second one is between modes of the given bimodal distributions.

Synthetic Latency by Brendan Gregg

Finally, let’s consider a set of synthetic latency distribution by Brendan Gregg (the origin can be found here). We are going to work with the following distributions (notation Mix(D1|W1;D2|W2) means a mixture of distribution D1 with weight W1 and distribution D2 with weight W2):

0: uniform narrow
Uniform(500,1500)
1: uniform wide
Uniform(0,3000)
2: uniform outliers
Mix(Uniform(500,1500)|0.99;Uniform(1500,10000)|0.01)
100: unimodal normal narrow
Normal(1000,100^2)
101: unimodal normal medium
Normal(1000,200^2)
102: unimodal normal wide
Normal(1000,300^2)
103: unimodal normal with tail
Mix(Normal(1000,200^2)|0.96;Uniform(1000,2250)|0.04)
104: unimodal normal wide
Normal(1120,700^2)
111: uniform normal spike
Mix(Normal(1000,200^2)|0.98;Normal(750,1^2)|0.02)
130: unimodal pareto narrow
Pareto(1000,3)
131: unimodal pareto wide
Pareto(1000,10)
140: unimodal normal outliers 1% medium
Mix(Normal(1000,200^2)|0.99;Uniform(1000,5000)|0.01)
141: unimodal normal outliers 1% far
Mix(Normal(1000,200^2)|0.99;Uniform(1000,10000)|0.01)
142: unimodal normal outliers 1% very far
Mix(Normal(1000,200^2)|0.99;Uniform(1000,50000)|0.01)
143: unimodal normal outliers 2%
Mix(Normal(1000,200^2)|0.98;Uniform(1000,5000)|0.02)
144: unimodal normal outliers 4%
Mix(Normal(1000,200^2)|0.96;Uniform(1000,5000)|0.04)
145: unimodal normal outliers 2% clustered
Mix(Normal(1000,200^2)|0.98;Normal(3000,35^2)|0.02)
146: unimodal normal outliers 4% close 1
Mix(Normal(1000,200^2)|0.96;Uniform(1000,2700)|0.04)
147: unimodal normal outliers 4% close 2
Mix(Normal(1000,200^2)|0.96;Uniform(1000,2900)|0.04)
148: unimodal normal outliers 4% close 3
Mix(Normal(1000,200^2)|0.96;Uniform(1000,3100)|0.04)
149: unimodal normal outliers 4% close 4
Mix(Normal(1000,200^2)|0.96;Uniform(1000,3300)|0.04)
150: unimodal normal outliers 4% close 5
Mix(Normal(1000,200^2)|0.96;Uniform(1000,3500)|0.04)
151: unimodal normal outliers 4% close 6
Mix(Normal(1000,200^2)|0.96;Uniform(1000,3700)|0.04)
152: unimodal normal outliers 4% close 7
Mix(Normal(1000,200^2)|0.96;Uniform(1000,3900)|0.04)
153: unimodal normal outliers 0.5%
Mix(Normal(1000,200^2)|0.995;Uniform(1000,5000)|0.005)
154: unimodal normal outliers 0.2%
Mix(Normal(1000,200^2)|0.998;Uniform(1000,5000)|0.002)
155: unimodal normal outliers 0.1%
Mix(Normal(1000,200^2)|0.999;Uniform(1000,5000)|0.001)
200: bimodal normal very close
Mix(Normal(850,110^2)|0.5;Normal(1150,110^2)|0.5)
201: bimodal normal close
Mix(Normal(825,110^2)|0.5;Normal(1175,110^2)|0.5)
202: bimodal normal medium
Mix(Normal(750,110^2)|0.5;Normal(1250,110^2)|0.5)
203: bimodal normal far
Mix(Normal(600,110^2)|0.5;Normal(1400,110^2)|0.5)
204: bimodal normal outliers 1%
Mix(Normal(750,110^2)|0.495;Normal(1250,110^2)|0.495;Uniform(1000,5000)|0.01)
205: bimodal normal outliers 2%
Mix(Normal(750,110^2)|0.49;Normal(1250,110^2)|0.49;Uniform(1000,5000)|0.02)
206: bimodal normal outliers 4%
Mix(Normal(750,110^2)|0.48;Normal(1250,110^2)|0.48;Uniform(1000,5000)|0.04)
210: bimodal normal major minor
Mix(Normal(750,110^2)|0.7;Normal(1250,110^2)|0.3)
211: bimodal normal minor major
Mix(Normal(750,110^2)|0.3;Normal(1250,110^2)|0.7)
212: bimodal normal major minor outliers
Mix(Normal(750,110^2)|0.695;Normal(1250,110^2)|0.295;Uniform(1000,5000)|0.01)
213: bimodal normal major minor outliers
Mix(Normal(750,110^2)|0.295;Normal(1250,110^2)|0.695;Uniform(1000,5000)|0.01)
214: bimodal far normal far outliers 1%
Mix(Normal(500,150^2)|0.499;Normal(2000,300^2)|0.499;Uniform(1000,180000)|0.002)
215: bimodal very far normal far outliers 1%
Mix(Normal(500,100^2)|0.499;Normal(4000,500^2)|0.499;Uniform(1000,180000)|0.002)
216: bimodal very far major minor outliers 1%
Mix(Normal(500,100^2)|0.667;Normal(4000,100^2)|0.333;Uniform(1000,180000)|0.002)
300: trimodal normal close
Mix(Normal(750,90^2)|0.333;Normal(1000,90^2)|0.334;Normal(1250,90^2)|0.333)
301: trimodal normal medium
Mix(Normal(500,100^2)|0.333;Normal(1000,100^2)|0.334;Normal(1500,100^2)|0.333)
302: trimodal normal far
Mix(Normal(500,65^2)|0.333;Normal(1000,65^2)|0.334;Normal(1500,65^2)|0.333)
303: trimodal normal outliers
Mix(Normal(500,100^2)|0.333;Normal(1000,100^2)|0.334;Normal(1500,100^2)|0.333;Uniform(1000,5000)|0.01)
304: trimodal normal major medium minor
Mix(Normal(500,100^2)|0.5;Normal(1000,100^2)|0.33;Normal(1500,100^2)|0.17)
305: trimodal normal minor major minor
Mix(Normal(500,100^2)|0.25;Normal(1000,100^2)|0.5;Normal(1500,100^2)|0.25)
306: trimodal normal minor major medium
Mix(Normal(500,100^2)|0.17;Normal(1000,100^2)|0.5;Normal(1500,100^2)|0.33)
307: trimodal normal major minor medium
Mix(Normal(500,100^2)|0.5;Normal(1000,100^2)|0.17;Normal(1500,100^2)|0.33)
400: quad normal close
Mix(Normal(700,75^2)|0.25;Normal(900,75^2)|0.25;Normal(1100,75^2)|0.25;Normal(1300,75^2)|0.25)
401: quad normal medium
Mix(Normal(700,50^2)|0.25;Normal(900,50^2)|0.25;Normal(1100,50^2)|0.25;Normal(1300,50^2)|0.25)
402: quad normal far
Mix(Normal(400,60^2)|0.25;Normal(800,60^2)|0.25;Normal(1200,60^2)|0.25;Normal(1600,60^2)|0.25)
403: quad normal outliers
Mix(Normal(700,50^2)|0.25;Normal(900,50^2)|0.25;Normal(1100,50^2)|0.25;Normal(1300,50^2)|0.25;Uniform(1000,5000)|0.01)

Here are the density and relative efficiency plots:

Here we can observe the same situation as in the previous section: HDQE is inefficient in the tail area of heavy-tailed distributions (with high outliers) and intermodal area of multimodal distributions.

Conclusion

In most simple cases, the Harrell-Davis quantile estimator is more efficient than the traditional one. However, it could be less efficient in areas closed to the low-density regions like tail areas of heavy-tailed distributions and intermodal areas of multimodal distributions.

References

[Harrell1982]
Harrell, F.E. and Davis, C.E., 1982. A new distribution-free quantile estimator. Biometrika, 69(3), pp.635-640.
https://doi.org/10.2307/2335999
[Hyndman1996]
Hyndman, R. J. and Fan, Y. 1996. Sample quantiles in statistical packages, American Statistician 50, 361–365.
https://doi.org/10.2307/2684934