Finite-sample bias correction factors for Rousseeuw-Croux scale estimators

Andrey Akinshin · 2022-09-06
Update: this blog post is a part of research that aimed to investigate finite-sample properties of the Rousseeuw-Croux scale estimators. A preprint with final results is available on arXiv: arXiv:2209.12268 [stat.ME]. Some information in this blog post can be obsolete: please, use the preprint as the primary reference.

The Rousseeuw-Croux scale estimators $S_n$ and $Q_n$ are efficient alternatives to the median absolute deviation ($\operatorname{MAD}_n$). While all three estimators have the same breakdown point of $50\%$, $S_n$ and $Q_n$ have higher statistical efficiency than $\operatorname{MAD}_n$. The asymptotic Gaussian efficiency values of $\operatorname{MAD}_n$, $S_n$, and $Q_n$ are $37\%$, $58\%$, and $82\%$ respectively.

Using scale constants, we can make $S_n$ and $Q_n$ consistent estimators for the standard deviation under normality. The asymptotic values of these constants are well-known. However, for finite-samples, only approximated scale constants are known. In this post, we provide refined values of these constants with higher accuracy.

Introduction

The $S_n$ and $Q_n$ estimators are presented in rousseeuw1993. For a sample $x = \{ x_1, x_2, \ldots, x_n \}$, they are defined as follows:

$$ \newcommand{\Sn}{\operatorname{S}_n} \newcommand{\Qn}{\operatorname{Q}_n} \newcommand{\MAD}{\operatorname{MAD}} \newcommand{\med}{\operatorname{med}} \newcommand{\lomed}{\operatorname{lomed}} \newcommand{\himed}{\operatorname{himed}} S_n = c_n \cdot 1.1926 \cdot \operatorname{lowmed}_i \; \operatorname{highmed}_j \; |x_i - x_j|, $$$$ Q_n = d_n \cdot 2.2191 \cdot \{ |x_i-x_j|; i < j \}_{(k)}, $$

where

  • $\operatorname{lowmed}$ is the $\lfloor (n+1) / 2 \rfloor^\textrm{th}$ order statistic out of $n$ numbers,
  • $\operatorname{highmed}$ is the $(\lfloor n / 2 \rfloor + 1)^\textrm{th}$ order statistic out of $n$ numbers,
  • $c_n$, $d_n$ are bias-correction factors for finite samples (some approximation can be found in [Rousseeuw1993]),
  • $k = \binom{\lfloor n / 2 \rfloor + 1}{2}$,
  • ${}_{(k)}$ is the $k^\textrm{th}$ order statistic.

In croux1992), rough approximations of $c_n$ and $d_n$ are given. The values for $n \leq 9$ are presented in the following table:

n$c_n$$d_n$
20.7430.399
31.8510.994
40.9540.512
51.3510.844
60.9930.611
71.1980.857
81.0050.669
91.1310.872

For $n \geq 10$, they suggested using the following prediction equations:

$$ c_n = \frac{n}{n - 0.9}, \quad \textrm{for odd}\; n, $$$$ c_n = 1, \quad \textrm{for even}\; n, $$$$ d_n = \frac{n}{n + 1.4}, \quad \textrm{for odd}\; n, $$$$ d_n = \frac{n}{n + 3.8}, \quad \textrm{for even}\; n. $$

In the R package robustbase, adjusted values of $d_n$ are used:

n$d_n$
20.399356
30.99365
40.51321
50.84401
60.61220
70.85877
80.66993
90.87344
100.72014
110.88906
120.75743

For $n > 12$, the following prediction equations are used:

$$ d_n = \big( 1 + 1.60188 n^{-1} - 2.1284 n^{-2} - 5.172 n^{-3} \big)^{-1}, \quad \textrm{for odd}\; n, $$$$ d_n = \big( 1 + 3.67561 n^{-1} + 1.9654 n^{-2} + 6.987 n^{-3} - 77 n^{-4} \big)^{-1}, \quad \textrm{for even}\; n. $$

Refined bias-correction factors

In order to obtain refined values of the bias-correction factors, we perform an extensive Monte-Carlo simulations. Here are the raw results for $n \leq 100$:

n$c_n$$d_n$
20.743030.39954
31.849830.99386
40.955050.51333
51.348570.84412
60.994130.61224
71.198320.85886
81.004960.67000
91.131780.87359
101.006890.72007
111.095920.88902
121.006350.75748
131.074230.90232
141.005130.78551
151.060060.91248
161.003840.80779
171.050060.92106
181.002810.82600
191.042970.92793
201.002190.84105
211.037380.93380
221.001390.85367
231.033110.93894
241.000910.86441
251.029690.94303
261.000660.87372
271.026860.94680
281.000450.88186
291.024490.95009
301.000050.88901
311.022600.95304
320.999950.89531
331.020870.95566
340.999740.90099
351.019500.95789
360.999780.90600
371.018300.96004
380.999600.91061
391.017170.96192
400.999690.91480
411.016190.96361
420.999600.91852
431.015380.96522
440.999550.92200
451.014600.96668
460.999600.92515
471.013910.96802
480.999480.92809
491.013240.96923
500.999530.93085
511.012640.97040
520.999540.93334
531.012280.97147
540.999490.93566
551.011750.97237
560.999500.93781
571.011270.97328
580.999550.93985
591.010900.97421
600.999590.94180
611.010540.97496
620.999540.94355
631.010230.97573
640.999630.94525
651.009880.97648
660.999680.94687
671.009510.97710
680.999590.94837
691.009230.97773
700.999660.94978
711.009020.97837
720.999650.95112
731.008770.97891
740.999640.95235
751.008510.97944
760.999660.95359
771.008350.97999
780.999680.95472
791.008100.98049
800.999660.95579
811.007900.98090
820.999700.95677
831.007650.98138
840.999700.95781
851.007620.98179
860.999680.95871
871.007400.98216
880.999720.95967
891.007230.98255
900.999730.96051
911.007050.98295
920.999740.96139
931.006890.98329
940.999740.96212
951.006740.98363
960.999780.96294
971.006610.98399
980.999730.96364
991.006500.98430
1000.999820.96438

For $n > 100$, we suggest using the following prediction equations:

$$ c_n = 1 + 0.7096 n^{-1} - 7.3604 n^{-2}, \quad \textrm{for odd}\; n, $$$$ c_n = 1 + 0.0391 n^{-1} - 6.1719 n^{-2}, \quad \textrm{for even}\; n, $$$$ d_n = 1 - 1.6022 n^{-1} + 4.7453 n^{-2}, \quad \textrm{for odd}\; n, $$$$ d_n = 1 -3.6741 n^{-1} + 11.1030 n^{-2}, \quad \textrm{for even}\; n. $$

Here are the corresponding plots:

References

  • [Croux1992]
    Croux, Christophe, and Peter J. Rousseeuw. “Time-Efficient Algorithms for Two Highly Robust Estimators of Scale.” In Computational Statistics, edited by Yadolah Dodge and Joe Whittaker, 411–28. Heidelberg: Physica-Verlag HD, 1992.
    https://doi.org/10.1007/978-3-662-26811-7_58.
  • [Rousseeuw1993]
    Rousseeuw, Peter J., and Christophe Croux. “Alternatives to the Median Absolute Deviation.” Journal of the American Statistical Association 88, no. 424 (December 1, 1993): 1273–83.
    https://doi.org/10.1080/01621459.1993.10476408.