Customization of the nonparametric Cohen's d-consistent effect size

Andrey Akinshin · 2021-06-08

One year ago, I publish a post called Nonparametric Cohen's d-consistent effect size. During this year, I got a lot of internal and external feedback from my own statistical experiments and people who tried to use the suggested approach. It seems that the nonparametric version of Cohen’s d works much better with real-life not-so-normal data. While the classic Cohen’s d based on the non-robust arithmetic mean and the non-robust standard deviation can be easily corrupted by a single outlier, my approach is much more resistant to unexpected extreme values. Also, it allows exploring the difference between specific quantiles of considered samples, which can be useful in the non-parametric case.

However, I wasn’t satisfied with the results of all of my experiments. While I still like the basic idea (replace the mean with the median; replace the standard deviation with the median absolute deviation), it turned out that the final results heavily depend on the used quantile estimator. To be more specific, the original Harrell-Davis quantile estimator is not always optimal; in most cases, it’s better to replace it with its trimmed modification. However, the particular choice of the quantile estimators depends on the situation. Also, the consistency constant for the median absolute deviation should be adjusted according to the current sample size and the used quantile estimator. Of course, it also can be replaced by other dispersion estimators that can be used as consistent estimators of the standard deviation.

In this post, I want to get a brief overview of possible customizations of the suggested metrics.

The generic equations

Let’s say we have two samples $x = \{ x_1, x_2, \ldots, x_{n_x} \}$ and $y = \{ y_1, y_2, \ldots, y_{n_y} \}$. The “classic” Cohen’s d can be defined as follows:

$$ d = \frac{\overline{y}-\overline{x}}{s} $$

where $s$ is the pooled standard deviation:

$$ s = \sqrt{\frac{(n_x - 1) s^2_x + (n_y - 1) s^2_y}{n_x + n_y - 2}}. $$

And here is the quantile-specific effect size suggested in the previous post:

$$ \gamma_p = \frac{Q_p(y) - Q_p(x)}{\operatorname{PMAD}_{xy}} $$

where $Q_p$ is a quantile estimator of the $p^\textrm{th}$ quantile, $\operatorname{PMAD}_{xy}$ is the pooled median absolute deviation:

$$ \operatorname{PMAD}_{xy} = \sqrt{\frac{(n_x - 1) \operatorname{MAD}^2_x + (n_y - 1) \operatorname{MAD}^2_y}{n_x + n_y - 2}}, $$

$\operatorname{MAD}_x$ and $\operatorname{MAD}_y$ are the median absolute deviations of $x$ and $y$:

$$ \operatorname{MAD}_x = C_{n_x} \cdot Q_{0.5}(|x_i - Q_{0.5}(x)|), \quad \operatorname{MAD}_y = C_{n_y} \cdot Q_{0.5}(|y_i - Q_{0.5}(y)|), $$

$C_{n_x}$ and $C_{n_y}$ are consistency constants that makes $\operatorname{MAD}$ a consistent estimator for the standard deviation estimation.

For the normal distribution, the Cohen’s d equals to $\gamma_{0.5}$:

$$ d = \frac{\overline{y}-\overline{x}}{s} \approx \frac{Q_{0.5}(y) - Q_{0.5}(x)}{\mathcal{PMAD}_{xy}} = \gamma_{0.5}. $$

Thus, $\gamma_{0.5}$ can be used as a robust alternative to the original Cohen’s d.

Customization

There are several things that we could customize in the above equations.

Summary

There are three main ways to adopt the nonparametric Cohen’s d-consistent effect size:

  • An easy way
    If you want to get the most simple solution, just use the traditional quantile estimator (if $n$ is odd, the median is the middle element of the sorted sample; if $n$ is even, the median is the arithmetic average of the two middle elements of the sorted sample). The $\operatorname{MAD}$ consistency constant should be taken from the main table of this post.
  • A relatively easy way
    If you want to get a relatively simple but more efficient solution, use the trimmed modifications of the Harrell-Davis quantile estimator and the $\operatorname{MAD}$ consistency constant from the main table of this post.
  • A hard way
    If you want to get the most efficient solution, you should spend some time on research. First of all, you should explore all available options (you can find some by following the below links). Next, you should think about the properties of your data sets (what kind of distribution you have, and what are your typical sample sizes). Finally, you should try different approaches with your data and check which one provides the most reliable results.

Further reading