# Unbiased median absolute deviation based on the Harrell-Davis quantile estimator

by Andrey Akinshin · 2021-02-16
THIS POST IS OUTDATED. Up-to-date preprint: PDF / arXiv:2207.12005 [stat.ME]
The below text contains an intermediate snapshot of the research and is preserved for historical purposes.

The median absolute deviation ($\textrm{MAD}$) is a robust measure of scale. In the previous post, I showed how to use the unbiased version of the $\textrm{MAD}$ estimator as a robust alternative to the standard deviation. “Unbiasedness” means that such estimator’s expected value equals the true value of the standard deviation. Unfortunately, there is such thing as the bias–variance tradeoff: when we remove the bias of the $\textrm{MAD}$ estimator, we increase its variance and mean squared error ($\textrm{MSE}$).

In this post, I want to suggest a more efficient unbiased $\textrm{MAD}$ estimator. It’s also a consistent estimator for the standard deviation, but it has smaller $\textrm{MSE}$. To build this estimator, we should replace the classic “straightforward” median estimator with the Harrell-Davis quantile estimator and adjust bias-correction factors. Let’s discuss this approach in detail.

## Introduction

Let’s consider a sample $x = \{ x_1, x_2, \ldots, x_n \}$. Its median absolute deviation $\textrm{MAD}_n$ can be defined as follows:

$$\textrm{MAD}_n = C_n \cdot \textrm{median}(|X - \textrm{median}(X)|)$$

where $\textrm{median}$ is a median estimator, $C_n$ is a scale factor (or consistency constant).

Typically, $\textrm{median}$ assumes the classic “straightforward” median estimator:

• If $n$ is odd, the median is the middle element of the sorted sample
• If $n$ is even, the median is the arithmetic average of the two middle elements of the sorted sample

However, we can use other median estimators. Let’s consider $\textrm{median}_{\textrm{HD}}$ which calculates the median using the Harrell-Davis quantile estimator (see ):

$$\textrm{median}_{\textrm{HD}}(x) = \sum_{i=1}^n W_i x_i, \quad W_i = I_{i/n} \bigg( \frac{n+1}{2}, \frac{n+1}{2} \bigg) - I_{(i-1)/n} \bigg( \frac{n+1}{2}, \frac{n+1}{2} \bigg)$$

where $I_t(a, b)$ denotes the regularized incomplete beta function.

When $n \to \infty$, we have the exact value of $C_n$ which makes $\textrm{MAD}_n$ an unbiased estimator for the standard deviation:

$$C_n = C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056$$

For finite values of $n$, we should use adjusted $C_n$ values. We already discussed these values for the straightforward median estimator in the previous post (this problem is well-covered in ). For $\textrm{median}_{\textrm{HD}}$, we should use another set of $C_n$ values. We reuse the approach from with the following notation:

$$C_n = \dfrac{1}{\hat{a}_n}$$

## Factors for small n

Factors for the small values of $n$ can be obtained using numerical experiments. Here is the scheme for the performed simulation:

• Enumerate $n$ from $2$ to $100$
• For each $n$, generate $2\cdot 10^8$ samples from the normal distribution of size $n$ (the Box–Muller transform is used, see [Box1958])
• For each sample, estimate $\textrm{MAD}_n$ using $C_n = 1$
• Calculate the arithmetic average of all $\textrm{MAD}_n$ values for the same $n$: the result is the value of $\hat{a}_n$

As the result of this simulation, I got the following table with $\hat{a}_n$ and $C_n$ values for $n \in \{ 2..100 \}$.

$n$$\hat{a}_n$$C_n$$n$$\hat{a}_n$$C_n 1NANA510.666491.50039 20.564171.77250520.666681.49998 30.637691.56816530.666821.49966 40.626611.59589540.666991.49926 50.638531.56611550.667131.49895 60.638341.56656560.667281.49863 70.639151.56458570.667411.49833 80.641411.55908580.667531.49805 90.642371.55675590.667671.49774 100.643971.55288600.667801.49746 110.645351.54955610.667911.49720 120.646621.54651620.668031.49694 130.647901.54346630.668151.49667 140.649081.54064640.668251.49644 150.650181.53803650.668361.49621 160.651251.53552660.668461.49597 170.652261.53313670.668571.49574 180.653171.53101680.668651.49555 190.654041.52896690.668761.49531 200.654891.52698700.668831.49514 210.655651.52520710.668931.49493 220.656381.52351720.669011.49475 230.657081.52190730.669101.49456 240.657711.52043740.669181.49437 250.658321.51902750.669251.49422 260.658881.51772760.669331.49402 270.659431.51647770.669401.49387 280.659911.51536780.669481.49370 290.660361.51433790.669551.49354 300.660821.51328800.669621.49339 310.661231.51233810.669681.49325 320.661611.51146820.669741.49312 330.662001.51057830.669801.49298 340.662351.50977840.669881.49281 350.662701.50899850.669931.49270 360.663021.50824860.669991.49257 370.663341.50753870.670051.49244 380.663621.50688880.670091.49233 390.663911.50623890.670161.49219 400.664171.50563900.670211.49207 410.664431.50504910.670261.49196 420.664691.50447920.670311.49185 430.664931.50393930.670361.49174 440.665151.50341940.670411.49161 450.665391.50289950.670461.49152 460.665571.50246960.670491.49144 470.665781.50200970.670551.49131 480.665981.50155980.670601.49121 490.666161.50115990.670631.49114 500.666331.500761000.670681.49102 Here is a visualization of this table: ## Factors for huge n To build the equation for n > 100, we also continue the approach suggested in (pages 2208-2209) and use the prediction equation of the following form:$$ \hat{a}_n = \Phi^{-1}(3/4) \Bigg( 1 - \dfrac{\alpha}{n} - \dfrac{\beta}{n^2} \Bigg). $$We simulated approximations of \hat{a}_n for some large n values: n$$\hat{a}_n$
1000.6707
1100.6711
1200.6714
1300.6716
1400.6719
1500.6720
2000.6727
2500.6731
3000.6733
3500.6735
4000.6736
4500.6737
5000.6738
10000.6742
15000.6743
20000.6743

Next, we fitted these values using the multiple linear regression. The dependent variable is $y = 1 - \hat{a}_n / \Phi^{-1}(3/4)$; the independent variables are $n^{-1}$ and $n^{-2}$; the intercept is zero:

$$y = \alpha n^{-1} + \beta n^{-2}.$$

The fitted model was adjusted a little bit to get nice-looking values (keeping the good accuracy level):

$$\alpha = 0.5,\quad \beta = 6.5.$$

These values give pretty accurate estimation of $\hat{a}_n$ for $n > 100$ (the accuracy is less than $10^{-4}$).

## The mean squared error

It’s time to compare the straightforward and the Harrell-Davis-based $\textrm{MAD}$ estimators in terms of the mean squared error. To estimate the $\textrm{MSE}$, we also use numerical simulations. Here are the results for some $n$ values:

nStraightforwardHarrell-Davis
30.6900.272
40.3270.205
50.3410.181
100.1360.100
200.0690.055
300.0450.038
400.0340.029
500.0270.024
1000.0140.012

As we can see, the difference between estimators for small samples is tangible. For example, for $n = 3$, we got $\textrm{MSE} \approx 0.69$ for the straightforward estimator vs. $\textrm{MSE} \approx 0.27$ for the Harrell-Davis-based estimator. Below you can see density plots of $\textrm{MSE}$ for the above $n$ values.

## Conclusion

In this post, we discussed the unbiased $\textrm{MAD}_n$ estimator which is based on the Harrell-Davis quantile estimator instead of the straightforward median estimator. The suggested estimator is much more efficient for small samples because it has smaller $\textrm{MSE}$. It allows using $\textrm{MAD}_n$ as a efficient alternative to the standard deviation under the normal distribution with higher accuracy.

## References

• [Harrell1982]
Harrell, F.E. and Davis, C.E., 1982. A new distribution-free quantile estimator. Biometrika, 69(3), pp.635-640.
https://doi.org/10.2307/2335999
• [Hayes2014]
Hayes, Kevin. “Finite-sample bias-correction factors for the median absolute deviation.” Communications in Statistics-Simulation and Computation 43, no. 10 (2014): 2205-2212.
https://doi.org/10.1080/03610918.2012.748913
• [Park2020]
Park, Chanseok, Haewon Kim, and Min Wang. “Investigation of finite-sample properties of robust location and scale estimators.” Communications in Statistics-Simulation and Computation (2020): 1-27.
https://doi.org/10.1080/03610918.2019.1699114
• [Box1958]
Box, George EP. “A note on the generation of random normal deviates.” Ann. Math. Statist. 29 (1958): 610-611.
https://doi.org/10.1214/aoms/1177706645