Standard quantile absolute deviation



Update: this blog post is a part of research that aimed to build a new measure of statistical dispersion called quantile absolute deviation. A preprint with final results is available on arXiv: arXiv:2208.13459 [stat.ME]. Some information in this blog post can be obsolete: please, use the preprint as the primary reference.

The median absolute deviation (MAD) is a popular robust replacement of the standard deviation (StdDev). It’s truly robust: its breakdown point is \(50\%\). However, it’s not so efficient when we use it as a consistent estimator for the standard deviation under normality: the asymptotic relative efficiency against StdDev (we call it the Gaussian efficiency) is only about \(\approx 37\%\).

In practice, such robustness is not always essential, while we typically want to have the highest possible efficiency. I already described the concept of the quantile absolute deviation which aims to provide a customizable trade-off between robustness and efficiency. In this post, I would like to suggest a new default option for this measure of dispersion called the standard quantile absolute deviation. Its Gaussian efficiency is \(\approx 54\%\) while the breakdown point is \(\approx 32\%\)

Introduction

In the context of this post, we consider the quantile absolute deviation (\(\operatorname{QAD}\)) around the median:

\[\newcommand{MAD}{\operatorname{MAD}} \newcommand{QAD}{\operatorname{QAD}} \newcommand{SQAD}{\operatorname{SQAD}} \newcommand{Q}{\operatorname{Q}} \newcommand{SD}{\operatorname{SD}} \newcommand{QHF}{\operatorname{Q}_{\operatorname{HF7}}} \newcommand{QHD}{\operatorname{Q}_{\operatorname{HD}}} \newcommand{QTHD}{\operatorname{Q}_{\operatorname{THD-SQRT}}} \newcommand{erf}{\operatorname{erf}} \newcommand{erfinv}{\operatorname{erf}^{-1}} \newcommand{E}{\mathbb{E}} \newcommand{V}{\mathbb{V}} \QAD(X, p) = \Q(|X - \Q(X, 0.5)|, p), \]

where \(Q\) is a quantile estimator, X is a sample of i.i.d. random variables \(X = \{ X_1, X_2, \ldots, X_n \}\).

Obviously, the median absolute deviation is a special case of \(\operatorname{QAD}\):

\[\MAD(X) = \Q(|X - \Q(X, 0.5)|, 0.5) = \QAD(X, 0.5). \]

As for \(\Q\), we consider the traditional quantile estimator \(\QHF\) (Type 7 in the Hyndman-Fan taxonomy, see [Hyndman1996]).

Efficiency vs. robustness

The classic standard equation estimator provides great Gaussian efficiency for the standard deviation, but it is not robust (the breakdown point is 0.0). The median absolute deviation provides poor Gaussian efficiency, but it is highly robust (the breakdown point is 0.5). When we choose an estimator, we always have to deal with a trade-off between efficiency and robustness. Finding the right balance can be a challenging task.

It is said that \(\MAD\) has the highest possible breakdown point for a scale estimator. Though, we have a theoretical option to build a scale estimator with a higher breakdown point (e.g., \(\QAD(X, 0.25)\)). However, such an estimator would not make much sense. Indeed, if more than 50% of the sample elements are corrupted by outliers, these outliers become the major part of the distribution. If these outliers are non-representative (obtained due to measurements error or other technical mistakes; gross errors), we should completely ignore them or reconsider the way we collect the data. Moreover, if the median itself is corrupted, estimating the deviation around the median is meaningless. If these outliers are representative (actually describe the true nature of the underlying model), we should treat them as a part of the distribution.

While 0.5 is the highest breakdown point for a scale estimator, it is not necessarily the optimal one. Let us consider a situation when 49% of a sample is contaminated by outliers. The \(\MAD\) could handle such a situation and provide a non-corrupted estimation. But do we really want to use \(\MAD\) in such a situation? If 49% of sample elements are outliers, do we really want to consider them as outliers? At some moment, it makes sense to consider them as an essential part of a distribution. Probably the distribution is multimodal, and it makes sense to consider each mode separately. Anyway, a single \(\MAD\) value can be quite misleading in such a situation.

Now let us say that we have learned to detect samples with a significant number of outliers to analyze them using a dedicated approach. In this case, we do not really need the 0.5 breakdown point. This means that we can trade some robustness for statistical efficiency. The next reasonable question is how to choose the optimal breakdown point.

Choosing the breakdown point

Let us recall the density plot of the normal distribution:


The \(\MAD\) and other measures of dispersion are often used as consistency estimators for the standard deviation. The idea is simple: we estimate the \(\MAD\), multiply it by a scale constant and get the standard deviation. It works well when the distribution is continuous, light-tailed, unimodal, and with only slight deviations from normality. In this case, we typically assume that the interval \([\mu-\sigma; \mu+\sigma]\) is not so distorted compared to the normal distribution. This assumption is violated if some elements of \([\mu-\sigma; \mu+\sigma]\) are corrupted. Let us consider the “protection” of this interval as a reference condition for choosing the breakdown point. For the normal distribution, \([\mu-\sigma; \mu+\sigma]\) covers exactly \(\Phi(1)-\Phi(-1) = 2\Phi(1) - 1 \approx 0.6827 = 68.27\%\) of the distribution. It gives us a breakdown point which is equal to \(1 - (\Phi(1)-\Phi(-1)) \approx 0.3173 = 31.73\%\). It could be interpreted as follows: if less than one-third of the distribution is contaminated, we are in the safe zone: the scale estimations are not corrupted. If more than one-third of the distribution is contaminated, we should split the distribution into several modes and analyze each mode independently. The breakdown point of \(37.73\%\) looks like an admissible value.

The standard quantile absolute deviation

Since we have the desired asymptotic breakdown point, we can define the corresponding measure of scale using \(\QAD\). Let us denote it by the standard quantile absolute deviation or \(\SQAD\). Asymptotically, it can be defined as follows:

\[\SQAD(X) = \QAD(X, 2\Phi(1) - 1). \]

There is no need to introduce a scale constant: the asymptotic expected value of \(\SQAD\) for the normal distribution is 1 (since the sample quantile are Fisher-consistent for the distribution quantiles when \(f(p)>0\) (see Theorem 8.5.1 in [Arnold2008]), \(\SQAD\) is Fisher-consistent for the standard deviation):

\[\E[\SQAD(X)] = \E[\QAD(X, 2\Phi(1) - 1)] = \E[Q(|X|, 2\Phi(1) - 1)] = 1. \]

If we want to make \(\SQAD\) an unbiased estimator for \(\SD\) under normality, we have to introduce the bias-correction factor \(C_n\):

\[\SQAD_n(X) = C_n \cdot \QAD(X, 2\Phi(1) - 1). \]

The factor values can be easily obtained using Monte-Carlo simulations. Here is the plot with these values for \(n \leq 100\):


And here are the raw factor values:

nfactor
31.35070
41.37644
51.18794
61.17720
71.12869
81.12460
91.09191
101.09434
111.07640
121.07376
131.06312
141.06379
151.05354
161.05383
171.04811
181.04673
191.04203
201.04285
211.03765
221.03745
231.03516
241.03428
251.03139
261.03192
271.02910
281.02915
291.02715
301.02712
311.02504
321.02533
331.02376
341.02346
351.02234
361.02257
371.02110
381.02097
391.02011
401.01985
411.01890
421.01917
431.01806
441.01800
451.01735
461.01722
471.01654
481.01655
491.01577
501.01577
511.01518
521.01524
531.01466
541.01458
551.01413
561.01404
571.01347
581.01369
591.01299
601.01310
611.01286
621.01258
631.01230
641.01237
651.01183
661.01194
671.01151
681.01145
691.01109
701.01120
711.01082
721.01089
731.01065
741.01056
751.01019
761.01023
771.01006
781.00999
791.00973
801.00977
811.00945
821.00949
831.00926
841.00923
851.00905
861.00903
871.00888
881.00879
891.00862
901.00864
911.00845
921.00843
931.00819
941.00821
951.00813
961.00820
971.00780
981.00789
991.00776
1001.00778
1091.00700
1101.00698
1191.00643
1201.00642
1291.00585
1301.00597
1391.00550
1401.00557
1491.00516
1501.00514
1591.00484
1601.00479
1691.00452
1701.00460
1791.00429
1801.00426
1891.00401
1901.00406
1991.00378
2001.00381
2491.00307
2501.00307
2991.00258
3001.00259
3491.00219
3501.00219
3991.00192
4001.00190
4491.00173
4501.00168
4991.00151
5001.00152
6001.00126
7001.00109
8001.00094
9001.00086
10001.00076
15001.00050
20001.00038
30001.00026

For \(n>100\), we can use the following prediction equation (obtained using least squares):

\[C_n = 1 + 0.762 n +0.868 n^2. \]

This equation perfectly matches the existing estimations:


The statistical efficiency of SQAD and MAD

We perform one more simulation study to evaluate the Gaussian efficiency of \(\SQAD\) and \(\MAD\). As for the baseline, we consider the unbiased standard deviation \(\SD_n\):

\[\SD_n(X) = \sqrt{\frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2} \bigg/ c_4(n), \quad c_4(n) = \sqrt{\frac{2}{n-1}}\frac{\Gamma(\frac{n}{2})}{\Gamma(\frac{n-1}{2})}. \]

The scheme of this simulation is straightforward: for each sample size \(n\), we generate multiple random samples of the given size from the standard normal distribution. For an unbiased scale estimator \(T_n\), we define the Gaussian efficiency as

\[e(T_n) = \frac{\V[\SD_n]}{\V[T_n]}. \]

Here is the plot of the efficiency values for \(n \leq 100\):


And here is the raw efficiency values (including the obtained biases and the standardized variance values):

neff.madeff.sqad
30.400850.92931
40.545960.61559
50.385620.64954
60.462520.67588
70.378460.58748
80.433540.59877
90.375880.62040
100.418180.58368
110.374290.57532
120.407840.60043
130.373050.57759
140.402630.56445
150.372020.58536
160.397870.57645
170.371800.55338
180.393280.57663
190.371130.57120
200.390530.54768
210.370510.56875
220.388510.56950
230.370610.54519
240.386860.56315
250.369990.56659
260.384480.54974
270.370020.55883
280.383860.56476
290.369960.55093
300.382310.55479
310.369680.56303
320.381740.55344
330.369250.55094
340.381180.56029
350.369080.55418
360.379860.54853
370.369250.55809
380.379600.55573
390.369260.54442
400.378390.55552
410.369040.55548
420.378120.54209
430.368560.55365
440.377290.55472
450.369340.54509
460.376550.55243
470.370100.55612
480.377460.54628
490.369100.55126
500.376150.55376
510.369220.54797
520.375190.54783
530.369560.55422
540.375820.54860
550.369410.54726
560.375360.55278
570.369170.54985
580.375000.54461
590.368670.55163
600.375060.55015
610.368700.54308
620.374890.55133
630.368410.54997
640.374350.54182
650.368940.55011
660.374010.54976
670.368430.54372
680.373680.54816
690.368770.54995
700.373730.54550
710.367870.54635
720.374200.55036
730.368720.54567
740.373850.54617
750.368470.54970
760.374250.54679
770.368920.54448
780.372620.54791
790.368250.54709
800.372770.54329
810.368440.54902
820.372530.54649
830.368270.54142
840.372600.54621
850.368510.54788
860.372320.54154
870.368280.54736
880.372600.54789
890.368190.54330
900.372190.54570
910.368570.54809
920.371990.54407
930.368190.54459
940.372120.54758
950.369290.54554
960.372030.54423
970.368950.54764
980.372040.54546
990.368330.54300
1000.372400.54883
1090.368460.54636
1100.372180.54689
1190.367970.54649
1200.371570.54510
1290.367930.54593
1300.371190.54175
1390.367840.54373
1400.370560.54266
1490.368790.54139
1500.370900.54441
1590.367520.54201
1600.370260.54493
1690.367960.54429
1700.370050.54434
1790.367940.54417
1800.370340.54388
1890.368220.54427
1900.369890.54206
1990.367700.54290
2000.369260.54206
2490.367930.54262
2500.369170.54152
2990.368020.54363
3000.369260.54122
3490.367630.54216
3500.368650.54212
3990.367090.54143
4000.368640.54230
4490.367430.54104
4500.368920.54312
4990.367190.53987
5000.368400.54189
6000.368510.54134
7000.368970.54134
8000.367440.54142
9000.369340.54187
10000.367410.53976
15000.368210.54124
20000.368360.54144
30000.368100.54149
100000.367820.54066
500000.368100.54080

According to this simulation study, the Gaussian efficiency of \(\SQAD\) is \(\approx 54\%\).

Asymptotic efficiency of SQAD

We already know the equation for the Gaussian efficiency of \(\QAD\):

\[\lim_{n \to \infty} e(\QAD_n(X, p),\; \SD_n(X)) = \Bigg( \pi p(1-p) \exp\Big(2\big(\erfinv(p) \big)^2 \Big) \Bigg)^{-1}. \]

To adopt this equation for \(\SQAD\), we should put \(p = \Phi(1) - \Phi(-1)\).

Since \(\Phi(1) = (1 + \erf(1/\sqrt{2})) / 2\), we can write

\[\erf(1/\sqrt{2}) = 2\Phi(1) - 1 = \Phi(1) - \Phi(-1), \]

which is the same as

\[2\Big(\erfinv(\Phi(1) - \Phi(-1)) \Big)^2 = 1. \]

This gives us a simple expression for the Gaussian efficiency of \(\SQAD_n(X)\):

\[\lim_{n \to \infty} e(\SQAD_n(X),\; \SD_n(X)) = \Big( \pi e \cdot (\Phi(1)-\Phi(-1)) \cdot (1-\Phi(1)+\Phi(-1)) \Big)^{-1} \approx 0.540565. \]


Summary

In this post, we discussed a new measure of statistical dispersion called the standard quantile absolute deviation (\(\SQAD\)) which is consistent with the standard deviation under normality. Its breakdown point is \(\approx 31.73\%\). When such robustness is enough, \(\SQAD\) can be a decent replacement for the median absolute deviation because it has higher statistical efficiency (\(\approx 54\%\)).

References