Customization of the nonparametric Cohen's d-consistent effect size
One year ago, I publish a post called Nonparametric Cohen's d-consistent effect size
By Andrey Akinshin
·
2020-06-25Nonparametric Cohen's d-consistent effect size.
During this year, I got a lot of internal and external feedback from
my own statistical experiments and
people
who
tried
to use the suggested approach.
It seems that the nonparametric version of Cohen’s d works much better with real-life not-so-normal data.
While the classic Cohen’s d based on
the non-robust arithmetic mean and
the non-robust standard deviation
can be easily corrupted by a single outlier,
my approach is much more resistant to unexpected extreme values.
Also, it allows exploring
the difference between specific quantiles of considered samples,
which can be useful in the non-parametric case.
However, I wasn’t satisfied with the results of all of my experiments. While I still like the basic idea (replace the mean with the median; replace the standard deviation with the median absolute deviation), it turned out that the final results heavily depend on the used quantile estimator. To be more specific, the original Harrell-Davis quantile estimator is not always optimal; in most cases, it’s better to replace it with its trimmed modification. However, the particular choice of the quantile estimators depends on the situation. Also, the consistency constant for the median absolute deviation should be adjusted according to the current sample size and the used quantile estimator. Of course, it also can be replaced by other dispersion estimators that can be used as consistent estimators of the standard deviation.
In this post, I want to get a brief overview of possible customizations of the suggested metrics.
The generic equations
Let’s say we have two samples $x = \{ x_1, x_2, \ldots, x_{n_x} \}$ and $y = \{ y_1, y_2, \ldots, y_{n_y} \}$. The “classic” Cohen’s d can be defined as follows:
$$ d = \frac{\overline{y}-\overline{x}}{s} $$where $s$ is the pooled standard deviation:
$$ s = \sqrt{\frac{(n_x - 1) s^2_x + (n_y - 1) s^2_y}{n_x + n_y - 2}}. $$And here is the quantile-specific effect size suggested in the previous post:
$$ \gamma_p = \frac{Q_p(y) - Q_p(x)}{\operatorname{PMAD}_{xy}} $$where $Q_p$ is a quantile estimator of the $p^\textrm{th}$ quantile, $\operatorname{PMAD}_{xy}$ is the pooled median absolute deviation:
$$ \operatorname{PMAD}_{xy} = \sqrt{\frac{(n_x - 1) \operatorname{MAD}^2_x + (n_y - 1) \operatorname{MAD}^2_y}{n_x + n_y - 2}}, $$$\operatorname{MAD}_x$ and $\operatorname{MAD}_y$ are the median absolute deviations of $x$ and $y$:
$$ \operatorname{MAD}_x = C_{n_x} \cdot Q_{0.5}(|x_i - Q_{0.5}(x)|), \quad \operatorname{MAD}_y = C_{n_y} \cdot Q_{0.5}(|y_i - Q_{0.5}(y)|), $$$C_{n_x}$ and $C_{n_y}$ are consistency constants that makes $\operatorname{MAD}$ a consistent estimator for the standard deviation estimation.
For the normal distribution, the Cohen’s d equals to $\gamma_{0.5}$:
$$ d = \frac{\overline{y}-\overline{x}}{s} \approx \frac{Q_{0.5}(y) - Q_{0.5}(x)}{\mathcal{PMAD}_{xy}} = \gamma_{0.5}. $$Thus, $\gamma_{0.5}$ can be used as a robust alternative to the original Cohen’s d.
Customization
There are several things that we could customize in the above equations.
Quantile estimator
The first thing we should define the quantile estimator $Q_p$. In many cases (especially in the case of light-tailed distribution), the traditional quantile estimator doesn’t provide optimal statistical efficiency. The Harrell-Davis quantile estimator provides great statistical efficiency in many light-tailed cases. Unfortunately, it could be inefficient in the case of heavy-tailed distributions. To fix this problem, we can consider trimmed and winsorized modifications of the Harrell-Davis quantile estimator. These modifications are more robust, and they have higher efficiency. For medium-size samples, we can use a trimming/winsorizing strategy based on the highest density interval of the Beta function. However, in some corner cases, advanced strategies may be required. It worth noting that in all of my experiments, the trimming modification works better than the winsorizing modification. For some situations, we can also consider using the Sfakianakis-Verginis quantile estimator or the Navruz-Özdemir quantile estimator. However, if you don’t have prior knowledge about the distribution form and the sample size, the Harrell-Davis quantile estimator still seems to be a good default option.Consistency constant
When we define the median absolute deviation, we should also define the consistency constant $C_n$. In the previous post, I suggested using $C_n = 1.4826$, but this value works only for large $n$ values. If we want to get an unbiased standard deviation estimator based on $\operatorname{MAD}_n$ for small n, we should adjust the consistency constant. The adjusting approach depends on the used quantile estimator. I wrote a few blog posts about that show how to choose the consistency constant for the traditional quantile estimator and how to choose the consistency constant for the Harrell-Davis quantile estimator.- $$
\operatorname{Shamos} = C_n \cdot \operatorname{median}_{i < j} (|x_i - x_j|); \quad C_{\infty} \approx 1.0484
$$$$
\operatorname{Rousseeuw-Croux} =
C_n \cdot \operatorname{median}_{i}
\Big( \operatorname{median}_{j} \big( |x_i-x_j| \big) \Big); \quad C_{\infty} \approx 1.1926
$$
I guess these estimators might provide better statistical efficiency in some cases, but I didn’t perform any experiments with them because they are computationally inefficient due to their algorithmic complexity (more than $O(n^2)$).
Summary
There are three main ways to adopt the nonparametric Cohen’s d-consistent effect size:
- An easy way
If you want to get the most simple solution, just use the traditional quantile estimator (if $n$ is odd, the median is the middle element of the sorted sample; if $n$ is even, the median is the arithmetic average of the two middle elements of the sorted sample). The $\operatorname{MAD}$ consistency constant should be taken from the main table of this post. - A relatively easy way
If you want to get a relatively simple but more efficient solution, use the trimmed modifications of the Harrell-Davis quantile estimator and the $\operatorname{MAD}$ consistency constant from the main table of this post. - A hard way
If you want to get the most efficient solution, you should spend some time on research. First of all, you should explore all available options (you can find some by following the below links). Next, you should think about the properties of your data sets (what kind of distribution you have, and what are your typical sample sizes). Finally, you should try different approaches with your data and check which one provides the most reliable results.
Further reading
- Effect sizes
- Nonparametric Cohen's d-consistent effect size
By Andrey Akinshin · 2020-06-25Nonparametric Cohen's d-consistent effect size - A single outlier could completely distort your Cohen's d value
By Andrey Akinshin · 2021-01-26
Comparison of classic Cohen's d and its non-parametric alternative on distributions with high outliersA single outlier could completely distort your Cohen's d value - Comparing distribution quantiles using gamma effect size
By Andrey Akinshin · 2021-02-02
Two case studies which show how to compare distributions using the gamma effect sizeComparing distribution quantiles using gamma effect size
- Nonparametric Cohen's d-consistent effect size
- Quantile estimators
- Navruz-Özdemir quantile estimatorNavruz-Özdemir quantile estimator
- Sfakianakis-Verginis Quantile EstimatorSfakianakis-Verginis Quantile Estimator
- Winsorized modification of the Harrell-Davis quantile estimator
By Andrey Akinshin · 2021-03-02
A modified version of the Harrell-Davis quantile estimator with better robustnessWinsorized modification of the Harrell-Davis quantile estimator - Trimmed modification of the Harrell-Davis quantile estimator
By Andrey Akinshin · 2021-03-30
A modified version of the Harrell-Davis quantile estimator with better robustnessTrimmed modification of the Harrell-Davis quantile estimator - Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies
By Andrey Akinshin · 2021-05-25Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies
- Statistical efficiency
- Efficiency of the Harrell-Davis quantile estimator
By Andrey Akinshin · 2021-03-23
A set of simulation studies that calculate the efficiency of the Harrell-Davis quantile estimator for different distributionsEfficiency of the Harrell-Davis quantile estimator - Efficiency of the winsorized and trimmed Harrell-Davis quantile estimators
By Andrey Akinshin · 2021-04-06
A set of simulation studies that calculate the efficiency of the winsorized and trimmed Harrell-Davis quantile estimator for different distributionsEfficiency of the winsorized and trimmed Harrell-Davis quantile estimators - Robust alternative to statistical efficiency
By Andrey Akinshin · 2021-06-01Robust alternative to statistical efficiency
- Efficiency of the Harrell-Davis quantile estimator
- Dispersion
- Quantile absolute deviation: estimating statistical dispersion around quantiles
By Andrey Akinshin · 2020-12-01
Quantile absolute deviation allows estimating statistical dispersion around the given quantileQuantile absolute deviation: estimating statistical dispersion around quantiles - Unbiased median absolute deviation
By Andrey Akinshin · 2021-02-09
The finite-sample bias-correction factors for the median absolute deviation which make it a consistent estimator for the standard deviationUnbiased median absolute deviation - Unbiased median absolute deviation based on the Harrell-Davis quantile estimator
By Andrey Akinshin · 2021-02-16
The finite-sample bias-correction factors for the median absolute deviation which make it a consistent estimator for the standard deviation (improved version based on the Harrell-Davis quantile estimator)Unbiased median absolute deviation based on the Harrell-Davis quantile estimator
- Quantile absolute deviation: estimating statistical dispersion around quantiles