Finite-sample efficiency of the Rousseeuw-Croux estimators

by Andrey Akinshin · 2022-08-09
Update: this blog post is a part of research that aimed to investigate finite-sample properties of the Rousseeuw-Croux scale estimators. A preprint with final results is available on arXiv: arXiv:2209.12268 [stat.ME]. Some information in this blog post can be obsolete: please, use the preprint as the primary reference.

The Rousseeuw-Croux $S_n$ and $Q_n$ estimators are robust and efficient measures of scale. Their breakdown points are equal to $0.5$ which is also the breakdown point of the median absolute deviation (MAD). However, their statistical efficiency values are much better than the efficiency of MAD. To be specific, the MAD asymptotic relative Gaussian efficiency against the standard deviation is about $37\%$, whereas the corresponding values for $S_n$ and $Q_n$ are $58\%$ and $82\%$ respectively. Although these numbers are quite impressive, they are only asymptotic values. In practice, we work with finite samples. And the finite-sample efficiency could be much lower than the asymptotic one. In this post, we perform a simulation study in order to obtain the actual finite-sample efficiency values for these two estimators.

Introduction

The $S_n$ and $Q_n$ estimators are presented in rousseeuw1993. For a sample $x = \{ x_1, x_2, \ldots, x_n \}$, they are defined as follows:

$$ S_n = c_n \cdot 1.1926 \cdot \operatorname{lowmed}_i \; \operatorname{highmed}_j \; |x_i - x_j|, $$$$ Q_n = d_n \cdot 2.2191 \cdot \{ |x_i-x_j|; i < j \}_{(k)}, $$

where

  • $\operatorname{lowmed}$ is the $\lfloor (n+1) / 2 \rfloor^\textrm{th}$ order statistic out of $n$ numbers,
  • $\operatorname{highmed}$ is the $(\lfloor n / 2 \rfloor + 1)^\textrm{th}$ order statistic out of $n$ numbers,
  • $c_n$, $d_n$ are bias-correction factors for finite samples (some approximation can be found in [Rousseeuw1993]),
  • $k = \binom{\lfloor n / 2 \rfloor + 1}{2}$,
  • ${}_{(k)}$ is the $k^\textrm{th}$ order statistic.

In this post, we also use the unbiased standard deviation which is defined as follows:

$$ s_n = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \bigg/ c_4(n), \quad c_4(n) = \sqrt{\frac{2}{n-1}}\frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2})}. $$

The median absolute deviation uses the bias-correction factors presented in [Akinshin2022].

Relative efficiency

In order to evaluate the relative statistical efficiency against the standard deviation under normality, we perform the following simulation study:

  • Enumerate various sample sizes $n$ between $3$ and $200$.
  • For each sample size $n$, generate $300\,000$ random samples from the standard normal distribution.
  • For each sample, estimate $s_n$, $\operatorname{MAD}_n$, $S_n$, $Q_n$.
  • For each $n$, calculate the relative statistical efficiency of $\operatorname{MAD}_n$, $S_n$, $Q_n$ against $s_n$ as $e(T_n) = \mathbb{V}(s_n) / \mathbb{V}(T_n)$.

Here is the plot with results for $n \leq 30$:

Here is the plot with all the results:

And here is a table with raw results:

nMADnSnQn
30.400560.399610.40050
40.545080.610050.60883
50.385910.434300.45723
60.463600.532210.61139
70.379280.471720.50996
80.433420.510680.62214
90.376090.500040.55606
100.418360.516440.63429
110.374390.526770.58308
120.408490.519980.64626
130.373160.535250.60392
140.402010.525360.65607
150.372220.542630.62221
160.397390.531260.66562
170.371390.548890.63675
180.393660.537150.67473
190.371070.553650.65006
200.390820.541980.68261
210.371030.558600.66187
220.388810.546600.69007
230.370450.562100.67145
240.386680.550750.69604
250.370210.565610.68085
260.385360.554660.70290
270.369950.568050.68823
280.383500.557160.70705
290.369560.570740.69561
300.381970.559590.71301
310.369610.572410.70165
320.381130.562360.71696
330.369110.573960.70695
340.380710.565340.72204
350.369180.576240.71234
360.379660.566560.72613
370.369420.577830.71760
380.378760.568630.72929
390.368800.578600.72118
400.378130.570070.73307
410.368640.579440.72553
420.377780.571670.73592
430.369020.581360.72991
440.377710.573140.73867
450.368720.581650.73254
460.376980.574150.74231
470.368450.582130.73608
480.376430.575130.74428
490.368100.582700.73894
500.376730.576480.74764
510.368790.584070.74207
600.374520.579650.75597
610.368420.586060.75376
700.373760.581940.76447
710.368430.587930.76228
800.372770.583610.77094
810.368270.587990.76872
900.372470.584550.77612
910.367900.588050.77317
1000.371840.584920.77968
1010.368140.588380.77800
1100.371370.585290.78308
1110.368060.588640.78142
1200.370930.585380.78582
1210.368530.588780.78514
1300.370630.585510.78816
1310.368000.588350.78685
1400.370800.585530.79057
1410.367930.588610.78943
1500.370450.585670.79202
1510.367980.588040.79157
1600.370400.585720.79443
1610.368170.588220.79363
1700.370310.585720.79571
1710.367810.587660.79449
1800.370010.585900.79686
1810.367510.587490.79582
1900.369770.585130.79752
1910.367470.586780.79695
2000.369600.585560.79917
2010.368060.587490.79857

Conclusion

As we can see, on small samples relative efficiency of $S_n$ and $Q_n$ are still better than the efficiency of $\operatorname{MAD}_n$, but it is much lower than the asymptotic values.

References

  • [Rousseeuw1992]
    Croux, Christophe, and Peter J. Rousseeuw. “Time-Efficient Algorithms for Two Highly Robust Estimators of Scale.” In Computational Statistics, edited by Yadolah Dodge and Joe Whittaker, 411–28. Heidelberg: Physica-Verlag HD, 1992.
    https://doi.org/10.1007/978-3-662-26811-7_58
  • [Rousseeuw1993]
    Rousseeuw, Peter J., and Christophe Croux. “Alternatives to the Median Absolute Deviation.” Journal of the American Statistical Association 88, no. 424 (December 1, 1993): 1273–83.
    https://doi.org/10.1080/01621459.1993.10476408
  • [Akinshin2022]
    Andrey Akinshin (2022) “Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification,” arXiv:2207.12005