Unbiased median absolute deviation based on the Harrell-Davis quantile estimator



The median absolute deviation (\(\textrm{MAD}\)) is a robust measure of scale. In the previous post, I showed how to use the unbiased version of the \(\textrm{MAD}\) estimator as a robust alternative to the standard deviation. “Unbiasedness” means that such estimator’s expected value equals the true value of the standard deviation. Unfortunately, there is such thing as the bias–variance tradeoff: when we remove the bias of the \(\textrm{MAD}\) estimator, we increase its variance and mean squared error (\(\textrm{MSE}\)).

In this post, I want to suggest a more efficient unbiased \(\textrm{MAD}\) estimator. It’s also a consistent estimator for the standard deviation, but it has smaller \(\textrm{MSE}\). To build this estimator, we should replace the classic “straightforward” median estimator with the Harrell-Davis quantile estimator and adjust bias-correction factors. Let’s discuss this approach in detail.

Introduction

Let’s consider a sample \(x = \{ x_1, x_2, \ldots, x_n \}\). Its median absolute deviation \(\textrm{MAD}_n\) can be defined as follows:

\[\textrm{MAD}_n = C_n \cdot \textrm{median}(|X - \textrm{median}(X)|) \]

where \(\textrm{median}\) is a median estimator, \(C_n\) is a scale factor (or consistency constant).

Typically, \(\textrm{median}\) assumes the classic “straightforward” median estimator:

  • If \(n\) is odd, the median is the middle element of the sorted sample
  • If \(n\) is even, the median is the arithmetic average of the two middle elements of the sorted sample

However, we can use other median estimators. Let’s consider \(\textrm{median}_{\textrm{HD}}\) which calculates the median using the Harrell-Davis quantile estimator (see [Harrell1982]):

\[\textrm{median}_{\textrm{HD}}(x) = \sum_{i=1}^n W_i x_i, \quad W_i = I_{i/n} \bigg( \frac{n+1}{2}, \frac{n+1}{2} \bigg) - I_{(i-1)/n} \bigg( \frac{n+1}{2}, \frac{n+1}{2} \bigg) \]

where \(I_t(a, b)\) denotes the regularized incomplete beta function.

When \(n \to \infty\), we have the exact value of \(C_n\) which makes \(\textrm{MAD}_n\) an unbiased estimator for the standard deviation:

\[C_n = C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056 \]

For finite values of \(n\), we should use adjusted \(C_n\) values. We already discussed these values for the straightforward median estimator in the previous post (this problem is well-covered in [Park2020]). For \(\textrm{median}_{\textrm{HD}}\), we should use another set of \(C_n\) values. We reuse the approach from [Hayes2014] with the following notation:

\[C_n = \dfrac{1}{\hat{a}_n} \]

Factors for small n

Factors for the small values of \(n\) can be obtained using numerical experiments. Here is the scheme for the performed simulation:

  • Enumerate \(n\) from \(2\) to \(100\)
  • For each \(n\), generate \(2\cdot 10^8\) samples from the normal distribution of size \(n\) (the Box–Muller transform is used, see [Box1958])
  • For each sample, estimate \(\textrm{MAD}_n\) using \(C_n = 1\)
  • Calculate the arithmetic average of all \(\textrm{MAD}_n\) values for the same \(n\): the result is the value of \(\hat{a}_n\)

As the result of this simulation, I got the following table with \(\hat{a}_n\) and \(C_n\) values for \(n \in \{ 2..100 \}\).

\(n\)\(\hat{a}_n\)\(C_n\)\(n\)\(\hat{a}_n\)\(C_n\)
1NANA510.666491.50039
20.564171.77250520.666681.49998
30.637691.56816530.666821.49966
40.626611.59589540.666991.49926
50.638531.56611550.667131.49895
60.638341.56656560.667281.49863
70.639151.56458570.667411.49833
80.641411.55908580.667531.49805
90.642371.55675590.667671.49774
100.643971.55288600.667801.49746
110.645351.54955610.667911.49720
120.646621.54651620.668031.49694
130.647901.54346630.668151.49667
140.649081.54064640.668251.49644
150.650181.53803650.668361.49621
160.651251.53552660.668461.49597
170.652261.53313670.668571.49574
180.653171.53101680.668651.49555
190.654041.52896690.668761.49531
200.654891.52698700.668831.49514
210.655651.52520710.668931.49493
220.656381.52351720.669011.49475
230.657081.52190730.669101.49456
240.657711.52043740.669181.49437
250.658321.51902750.669251.49422
260.658881.51772760.669331.49402
270.659431.51647770.669401.49387
280.659911.51536780.669481.49370
290.660361.51433790.669551.49354
300.660821.51328800.669621.49339
310.661231.51233810.669681.49325
320.661611.51146820.669741.49312
330.662001.51057830.669801.49298
340.662351.50977840.669881.49281
350.662701.50899850.669931.49270
360.663021.50824860.669991.49257
370.663341.50753870.670051.49244
380.663621.50688880.670091.49233
390.663911.50623890.670161.49219
400.664171.50563900.670211.49207
410.664431.50504910.670261.49196
420.664691.50447920.670311.49185
430.664931.50393930.670361.49174
440.665151.50341940.670411.49161
450.665391.50289950.670461.49152
460.665571.50246960.670491.49144
470.665781.50200970.670551.49131
480.665981.50155980.670601.49121
490.666161.50115990.670631.49114
500.666331.500761000.670681.49102

Here is a visualization of this table:


Factors for huge n

To build the equation for \(n > 100\), we also continue the approach suggested in [Hayes2014] (pages 2208-2209) and use the prediction equation of the following form:

\[\hat{a}_n = \Phi^{-1}(3/4) \Bigg( 1 - \dfrac{\alpha}{n} - \dfrac{\beta}{n^2} \Bigg). \]

We simulated approximations of \(\hat{a}_n\) for some large \(n\) values:

\(n\)\(\hat{a}_n\)
1000.6707
1100.6711
1200.6714
1300.6716
1400.6719
1500.6720
2000.6727
2500.6731
3000.6733
3500.6735
4000.6736
4500.6737
5000.6738
10000.6742
15000.6743
20000.6743

Next, we fitted these values using the multiple linear regression. The dependent variable is \(y = 1 - \hat{a}_n / \Phi^{-1}(3/4)\); the independent variables are \(n^{-1}\) and \(n^{-2}\); the intercept is zero:

\[y = \alpha n^{-1} + \beta n^{-2}. \]

The fitted model was adjusted a little bit to get nice-looking values (keeping the good accuracy level):

\[\alpha = 0.5,\quad \beta = 6.5. \]

These values give pretty accurate estimation of \(\hat{a}_n\) for \(n > 100\) (the accuracy is less than \(10^{-4}\)).


The mean squared error

It’s time to compare the straightforward and the Harrell-Davis-based \(\textrm{MAD}\) estimators in terms of the mean squared error. To estimate the \(\textrm{MSE}\), we also use numerical simulations. Here are the results for some \(n\) values:

nStraightforwardHarrell-Davis
30.6900.272
40.3270.205
50.3410.181
100.1360.100
200.0690.055
300.0450.038
400.0340.029
500.0270.024
1000.0140.012

As we can see, the difference between estimators for small samples is tangible. For example, for \(n = 3\), we got \(\textrm{MSE} \approx 0.69\) for the straightforward estimator vs. \(\textrm{MSE} \approx 0.27\) for the Harrell-Davis-based estimator. Below you can see density plots of \(\textrm{MSE}\) for the above \(n\) values.










Conclusion

In this post, we discussed the unbiased \(\textrm{MAD}_n\) estimator which is based on the Harrell-Davis quantile estimator instead of the straightforward median estimator. The suggested estimator is much more efficient for small samples because it has smaller \(\textrm{MSE}\). It allows using \(\textrm{MAD}_n\) as a efficient alternative to the standard deviation under the normal distribution with higher accuracy.

References

  • [Harrell1982]
    Harrell, F.E. and Davis, C.E., 1982. A new distribution-free quantile estimator. Biometrika, 69(3), pp.635-640.
    https://doi.org/10.2307/2335999
  • [Hayes2014]
    Hayes, Kevin. “Finite-sample bias-correction factors for the median absolute deviation.” Communications in Statistics-Simulation and Computation 43, no. 10 (2014): 2205-2212.
    https://doi.org/10.1080/03610918.2012.748913
  • [Park2020]
    Park, Chanseok, Haewon Kim, and Min Wang. “Investigation of finite-sample properties of robust location and scale estimators.” Communications in Statistics-Simulation and Computation (2020): 1-27.
    https://doi.org/10.1080/03610918.2019.1699114
  • [Box1958]
    Box, George EP. “A note on the generation of random normal deviates.” Ann. Math. Statist. 29 (1958): 610-611.
    https://doi.org/10.1214/aoms/1177706645