Calculating gamma effect size for samples with zero median absolute deviation



In previous posts, I discussed the gamma effect size which is a Cohen’s d-consistent nonparametric and robust measure of the effect size. Also, I discussed various ways to customize this metric and adjust it to different kinds of business requirements. In this post, I want to briefly cover one more corner case that requires special adjustments. We are going to discuss the situation when the median absolute deviation is zero.

Recall

First of all, recall the general equation for the gamma effect size for the \(p^\textrm{th}\) quantile:

\[\gamma_p = \frac{Q_p(y) - Q_p(x)}{\operatorname{PMAD}_{xy}} \]

where \(Q_p\) is a quantile estimator of the \(p^\textrm{th}\) quantile, \(\operatorname{PMAD}_{xy}\) is the pooled median absolute deviation:

\[\operatorname{PMAD}_{xy} = \sqrt{\frac{(n_x - 1) \operatorname{MAD}^2_x + (n_y - 1) \operatorname{MAD}^2_y}{n_x + n_y - 2}}, \]

\(\operatorname{MAD}_x\) and \(\operatorname{MAD}_y\) are the median absolute deviations of \(x\) and \(y\):

\[\operatorname{MAD}_x = C_{n_x} \cdot Q_{0.5}(|x_i - Q_{0.5}(x)|), \quad \operatorname{MAD}_y = C_{n_y} \cdot Q_{0.5}(|y_i - Q_{0.5}(y)|), \]

\(C_{n_x}\) and \(C_{n_y}\) are consistency constants that makes \(\operatorname{MAD}\) a consistent estimator for the standard deviation estimation.

The problem

Here is a real-life dataset from my previous post:


And here is the corresponding histogram:


It’s hard to work with such a histogram because of the scale, so here is its raw data:

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
1996   92   45   23   20   21   14    5   11    5    8    8    3    3    9    5 
  16   17   18   19   20   21   22   23   25   26   27   31   33   34   35   36 
   3    5    1    1    3    5    4    1    4    1    1    1    1    1    1    2 
  37   41   46   49   56   62   63   65   71   72   73   85   91   94   95   97 
   1    2    1    2    1    1    2    1    1    1    1    2    1    1    1    1 
  98  100  102  103  107  109  114  117  119  124  125  126  132  136  138  140 
   1    1    1    1    1    1    1    1    1    2    2    1    1    1    1    2 
 143  146  147  148  152  153  158  160  161  162  163  164  165  166  167  168 
   1    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
 172  173  175  177  178  179  183  184  185  186  187  188  189  190  196  199 
   1    1    1    1    1    1    1    1    1    2    1    1    1    1    2    1 
 201  203  204  206  209  211  215  217  218  223  224  231  238  242  243  246 
   1    1    1    1    1    2    1    1    1    1    1    1    1    1    1    1 
 260  261  262  263  264  265  273  288  289  295  297  298  303  309  313  320 
   1    1    1    1    1    1    1    1    1    1    2    1    2    1    1    1 
 347  350 
   1    1 

These numbers mean the follow: we observed 0ms 1996 times, 1ms 92 times, 2ms 45 times, and so on. And here are the same set of numbers scaled to percents:

    0     1     2     3     4     5     6     7     8     9    10    11    12 
82.68  3.81  1.86  0.95  0.83  0.87  0.58  0.21  0.46  0.21  0.33  0.33  0.12 
   13    14    15    16    17    18    19    20    21    22    23    25    26 
 0.12  0.37  0.21  0.12  0.21  0.04  0.04  0.12  0.21  0.17  0.04  0.17  0.04 
   27    31    33    34    35    36    37    41    46    49    56    62    63 
 0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.08  0.04  0.08  0.04  0.04  0.08 
   65    71    72    73    85    91    94    95    97    98   100   102   103 
 0.04  0.04  0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  107   109   114   117   119   124   125   126   132   136   138   140   143 
 0.04  0.04  0.04  0.04  0.04  0.08  0.08  0.04  0.04  0.04  0.04  0.08  0.04 
  146   147   148   152   153   158   160   161   162   163   164   165   166 
 0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  167   168   172   173   175   177   178   179   183   184   185   186   187 
 0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.08  0.04 
  188   189   190   196   199   201   203   204   206   209   211   215   217 
 0.04  0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.04 
  218   223   224   231   238   242   243   246   260   261   262   263   264 
 0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  265   273   288   289   295   297   298   303   309   313   320   347   350 
 0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.08  0.04  0.04  0.04  0.04  0.04

Thus, we observed 0ms in the 92.69% cases, 1ms in the 3.81% cases, 2ms in 1.86% cases, and so on. In this data set, both the median and the median absolute deviations are zero. Here the observed data could be perfectly described using the discrete Weibull distribution.

If we try to compare samples from similar distributions using the gamma effect size, we get a problem because of the zero denominator.

QAD to the rescue

In the above scenario, it’s meaningless to compare medians values. We can have a situation of different distributions with equal median values (and zero median absolute deviations). In such cases, it makes sense to compare higher quantiles instead of the median. However, it doesn’t solve the zero denominator problem.

The problem can be solved using the Quantile Absolute Deviation(QAD) around the given quantile:

\[\operatorname{QAD}_x(p, q) = C_n \cdot Q_q(|x_i - Q_p(x)|) \]

It’s easy to see that the \(\operatorname{MAD}\) is just a special case of \(\operatorname{QAD}\):

\[\operatorname{MAD}_x = \operatorname{QAD}_x(0.5, 0.5). \]

By analogy with \(\operatorname{MAD}\), we can define the pooled quantile absolute deviation \(\operatorname{PQAD}_{xy}\):

\[\operatorname{PQAD}_{xy}(p, q) = \sqrt{\frac{ (n_x - 1) \operatorname{QAD}^2_x(p, q) + (n_y - 1) \operatorname{QAD}^2_y(p, q)}{n_x + n_y - 2}}, \]

When we estimate the gamma effect size of the \(p^\textrm{th}\) quantile \(\gamma_p\), it makes perfect sense to evaluate the quantile absolute deviation around the same quantile. Although I don’t have specific recommendations for the value of \(q\), we can start with \(q=0.5\) as a starting point and adjust it if necessary.

Conclusion

We should always remember that real-life data contains tons of corner cases that may become a problem if we want to analyze this data. It’s better to think about these corner cases in advance and come up with proper solutions. In this post, we patched the gamma effect size for distribution with zero median absolute deviation using the quantile absolute deviation. This trick allows comparing higher quantiles for such distributions. A typical real-life example is the Weibull distribution.