Unbiased median absolute deviation based on the trimmed Harrell-Davis quantile estimator



The median absolute deviation (\(\operatorname{MAD}\)) is a robust measure of scale. For a sample \(x = \{ x_1, x_2, \ldots, x_n \}\), it’s defined as follows:

\[\operatorname{MAD}_n = C_n \cdot \operatorname{median}(|x - \operatorname{median}(x)|) \]

where \(\operatorname{median}\) is a median estimator, \(C_n\) is a scale factor. Using the right scale factor, we can use \(\operatorname{MAD}\) as a consistent estimator for the estimation of the standard deviation under the normal distribution. For huge samples, we can use the asymptotic value of \(C_n\) which is

\[C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. \]

For small samples, we should use adjusted values \(C_n\) which depend on the sample size. However, \(C_n\) depends not only on the sample size but also on the median estimator. I have already covered how to obtain this values for the traditional median estimator and the Harrell-Davis median estimator. It’s time to get the \(C_n\) values for the trimmed Harrell-Davis median estimator.

Simulation

In order to obtain the values of \(C_n\), we perform the following simulation:

  • Enumerate \(n\) from \(2\) to \(100\) and some other values up to \(2000\).
  • For each \(n\), generate \(2\cdot 10^8\) samples from the normal distribution of size \(n\) (the Box–Muller transform is used, see [Box1958])
  • For each sample, estimate \(\textrm{MAD}_n\) using \(C_n = 1\) and the trimmed Harrell-Davis quantile estimator based on the highest density interval of the width \(1/sqrt(n)\) (aka THD-SQRT).
  • Calculate the arithmetic average of all \(\textrm{MAD}_n\) values for the same \(n\): the result is the value of \(\hat{a}_n\)
  • The reverse value is the estimation of \(C_n = 1/\hat{a}_n\).

Warning: on my PC, the simulation takes several days to finish.

Results for small n

Here are the results for \(n \leq 100\):

n\(C_n\)
21.7724370908056342
31.6455078173901185
42.017065193495793
51.6774358847728241
61.6886228884833927
71.681012890164465
81.6363164157283878
91.6430595153542609
101.6137117695811252
111.6036656575624237
121.5938702075534783
131.5826054249267754
141.5770618699717638
151.568314144069548
161.5639360738331398
171.5574345825637932
181.5530068367429388
191.548786644010622
201.5449266898438678
211.5417268298909585
221.5385447997218455
231.5360433134061504
241.5333044878734197
251.5313026814346553
261.5289087814370173
271.5271766811326293
281.525372045244657
291.523823710130717
301.5223707036738177
311.5210003444944378
321.5198014780734148
331.5185318176366789
341.5174778699831464
351.5163204121308342
361.515455864530263
371.514367919034956
381.5135668482892533
391.512646068257695
401.5119358224536255
411.5111386979568142
421.5104143980692895
431.5097598792706493
441.5090605208097354
451.5085030887342528
461.5078365336949058
471.5073406253974977
481.506750314567103
491.5063216542860618
501.5056976215756004
511.5052645904785493
521.5047506790429501
531.504377980905483
541.5039313123542895
551.5035247994299705
561.5031546583298019
571.5027213700422273
581.5023816126090916
591.501979658633948
601.5016812388186662
611.5013235911210157
621.5010571020037526
631.5006852968235684
641.5004714716576355
651.5001142465452528
661.4998450017251428
671.4995612278126766
681.4993166731144414
691.499031830812627
701.498806802914833
711.4985656090426709
721.498311085994739
731.4980998089801867
741.4978614270801778
751.4976461613303127
761.497450811659424
771.4972614163628544
781.497019612065497
791.496874216833877
801.4966283982987492
811.4964878123374772
821.4962757052242632
831.496149471166867
841.4959392546341381
851.4957683906834602
861.4956060922384664
871.4954645049047413
881.495307563560535
891.4951503339020875
901.49500530886384
911.4948486175521838
921.494716574686503
931.4946021151410946
941.4944505753829063
951.4942998891427248
961.494198277219056
971.4940563287162394
981.4939476227923159
991.493783721728854
1001.4937221140129977

Results for huge n

I have also calculated \(C_n\) for some bigger sample sizes:

n\(C_n\)
1001.4937221140129977
1101.492629467879562
1201.491756950186543
1301.4910176116133274
1401.4903922085488397
1501.4898464598871315
2001.487963070934298
2501.4868507636216115
3001.4861201577554497
3501.4855909210465295
4001.4852251546501989
4501.4849172381877342
5001.4846796356493557
10001.4836281802634783
15001.4832857748830472
20001.4831139879531638
25001.4830091511037906
30001.4829398344041957
35001.4828931807149206
40001.4828555508375267
45001.4828287173605137

Following the approach from [Park2020], we express \(C_n\) using the following equation:

\[C_n = \frac{1}{\Phi^{-1}(3/4) \cdot (1 - \alpha / n - \beta / n^2)} \]

With the help of linear regression, we get the following values of \(\alpha\) and \(beta\):

\[\alpha \approx 0.69, \quad \beta = 5.14. \]

Here is a plot with both actual and predicted values of \(C_n\) for different sample sizes:


References