Posts / Comparing the efficiency of the Harrell-Davis, Sfakianakis-Verginis, and Navruz-Özdemir quantile estimators


In the previous posts, I discussed the statistical efficiency of different quantile estimators (Efficiency of the Harrell-Davis quantile estimator and Efficiency of the winsorized and trimmed Harrell-Davis quantile estimators).

In this post, I continue this research and compare the efficiency of the Harrell-Davis quantile estimator, the Sfakianakis-Verginis quantile estimators, and the Navruz-Özdemir quantile estimator.

Simulation design

The relative efficiency value depends on five parameters:

In this case study, we are going to compare three target quantile estimators:

(1) The Harrell-Davis (HD) quantile estimator ( harrell1982):

$$ Q_\textrm{HD}(p) = \sum_{i=1}^{n} W_{i} \cdot x_{(i)}, \quad W_{i} = I_{i/n}(a, b) - I_{(i-1)/n}(a, b), \quad a = p(n+1),\; b = (1-p)(n+1) $$

where $I_t(a, b)$ denotes the regularized incomplete beta function, $x_{(i)}$ is the $i^\textrm{th}$ order statistics.

(2) The Sfakianakis-Verginis (SV) quantile estimators ( sfakianakis2008):

$$ \begin{split} Q_\textrm{SV1}(p) =& \frac{B_0}{2} \big( X_{(1)}+X_{(2)}-X_{(3)} \big) + \sum_{i=1}^{n} \frac{B_i+B_{i-1}}{2} X_{(i)} + \frac{B_n}{2} \big(- X_{(n-2)}+X_{(n-1)}-X_{(n)} \big),\\ Q_\textrm{SV2}(p) =& \sum_{i=1}^{n} B_{i-1} X_{(i)} + B_n \cdot \big(2X_{(n)} - X_{(n-1)}\big),\\ Q_\textrm{SV3}(p) =& \sum_{i=1}^n B_i X_{(i)} + B_0 \cdot \big(2X_{(1)}-X_{(2)}\big). \end{split} $$

where $B_i = B(i; n, p)$ is probability mass function of the binomial distribution $B(n, p)$, $X_{(i)}$ are order statistics of sample $X$.

(3) The Navruz-Özdemir (NO) quantile estimator ( navruz2020):

$$ \begin{split} Q_\textrm{NO}(p) = & \Big( (3p-1)X_{(1)} + (2-3p)X_{(2)} - (1-p)X_{(3)} \Big) B_0 +\\ & +\sum_{i=1}^n \Big((1-p)B_{i-1}+pB_i\Big)X_{(i)} +\\ & +\Big( -pX_{(n-2)} + (3p-1)X_{(n-1)} + (2-3p)X_{(n)} \Big) B_n \end{split} $$

where $B_i = B(i; n, p)$ is probability mass function of the binomial distribution $B(n, p)$, $X_{(i)}$ are order statistics of sample $X$.

The conventional baseline quantile estimator in such simulations is the traditional quantile estimator that is defined as a linear combination of two subsequent order statistics. To be more specific, we are going to use the Type 7 quantile estimator from the Hyndman-Fan classification or HF7 ( hyndman1996). It can be expressed as follows (assuming one-based indexing):

$$ Q_\textrm{HF7}(p) = x_{(\lfloor h \rfloor)}+(h-\lfloor h \rfloor)(x_{(\lfloor h \rfloor+1)})-x_{(\lfloor h \rfloor)},\quad h = (n-1)p+1. $$

Thus, we are going to estimate the relative efficiency of HD, SV1, SV2, SV3, and NO quantile estimators comparing to the traditional quantile estimator HF7. For the $p^\textrm{th}$ quantile, the relative efficiency of the target quantile estimator $Q_\textrm{Target}$ can be calculated as the ratio of the estimator mean squared errors ($\textrm{MSE}$):

$$ \textrm{Efficiency}(p) = \dfrac{\textrm{MSE}(Q_{HF7}, p)}{\textrm{MSE}(Q_\textrm{Target}, p)} = \dfrac{\operatorname{E}[(Q_{HF7}(p) - \theta(p))^2]}{\operatorname{E}[(Q_\textrm{Target}(p) - \theta(p))^2]} $$

where $\theta(p)$ is the true value of the $p^\textrm{th}$ quantile. The $\textrm{MSE}$ value depends on the sample size $n$, so it should be calculated independently for each sample size value.

Finally, we should choose the distributions for sample generation. I decided to choose 4 light-tailed distributions and 4 heavy-tailed distributions

distributiondescription
Beta(2,10)Beta distribution with a=2, b=10
U(0,1)Uniform distribution on [0;1]
N(0,1^2)Normal distribution with mu=0, sigma=1
Weibull(1,2)Weibull distribution with scale=1, shape=2
Cauchy(0,1)Cauchy distribution with location=0, scale=1
Pareto(1, 0.5)Pareto distribution with xm=1, alpha=0.5
LogNormal(0,3^2)Log-normal distribution with mu=0, sigma=3
Exp(1) + Outliers95% of exponential distribution with rate=1 and 5% of uniform distribution on [0;10000]

Here are the probability density functions of these distributions:

For each distribution, we are going to do the following:

Here are the results of the simulation:

Here are the static charts for some of the $n$ values:

Conclusion

Based on the above simulation, we could draw the following observations:

References


References (7)