The median absolute deviation value of the Gumbel distribution
The Gumbel distribution is not only a useful model in the extreme value theory, but it’s also a nice example of a slightly right-skewed distribution (skewness $\approx 1.14$). Here is its density plot:
In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests.
I also prefer the median absolute deviation (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions.
Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution.
I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself.
In this post, you will find a short derivation and the result (spoiler: the exact value is 0.767049251325708 * β
).
The general approach of the MAD derivation is common for most distributions, so it can be easily reused.
General approach
The median absolute deviation of a sample has a simple definition:
$$ \mathcal{MAD}_0 = \textrm{Median}(|x_i - \textrm{Median}(x_i)|). $$The exact formula for the distribution looks similar:
$$ \mathcal{MAD}_0 = \textrm{Median}(|X - \textrm{Median}(X)|). $$It’s convenient to use a scaled version of MAD in many practical cases to make it a consistent estimator for the standard deviation estimation. It’s often denoted as $\mathcal{MAD}$ without any clarification. To avoid misinterpretation and highlight that we are working with a non-scaled version of MAD, we denote it as $\mathcal{MAD}_0$ instead of just $\mathcal{MAD}$.
We also denote the median distribution value as $M = \textrm{Median}(X)$. Thus, the $\mathcal{MAD}_0$ definition can be rewritten as
$$ \mathcal{MAD}_0 = \textrm{Median}(|X - M|). $$Since $\mathcal{MAD}_0$ is the median of $|X - M|$, we can conclude that the range $[M - \mathcal{MAD}_0; M + \mathcal{MAD}_0]$ contains $50\%$ of the distribution:
This can be expressed using the distribution CDF (let’s call it $F$):
$$ \begin{equation} \label{eq:main}\tag{1} F(M + \mathcal{MAD}_0) - F(M - \mathcal{MAD}_0) = 0.5. \end{equation} $$Equation ($\ref{eq:main}$) is an important property of the median absolute deviation, which we will use to calculate it’s exact value.
Gumbel distribution
The Gumbel distribution is parametrized by $\mu$ (location) and $\beta$ (scale). For these parameters, the median value and the CDF are well-known (see [HoSM], 1.3.6.6.16):
$$ M = \mu - \beta \cdot \ln(\ln(2)), \quad F(x) = e^{-e^{-(x-\mu)/\beta}}. $$The $\mathcal{MAD}_0$ value is directly proportional to the scale parameter $\beta$, so let’s introduce an auxiliary variable $p$ for the “descaled version” of $\mathcal{MAD}_0$:
$$ p = \mathcal{MAD}_0 / \beta. $$Next, let’s express $F(M + \mathcal{MAD}_0)$ via $p$:
$$ \begin{split} F(M + \mathcal{MAD}_0) & = F(\mu - \beta \cdot \ln(\ln(2)) + p \beta) = \\ & = e^{-e^{-(\mu - \beta \cdot \ln(\ln(2)) + p \beta - \mu)/\beta}} = \\ & = e^{-e^{\ln(\ln(2)) - p}} = e^{-\ln(2) e^{-p}} = 0.5^{e^{-p}}, \end{split} $$In the same way, we can show that
$$ F(M - \mathcal{MAD}_0) = 0.5^{e^{p}}. $$Now, equation ($\ref{eq:main}$) can be transformed to
$$ 0.5^{e^{-p}} - 0.5^{e^{p}} = 0.5. $$This equation can be solved numerically. Here is a short R script which calculates the result:
library(rootSolve)
options(digits = 15)
f <- function(p) 0.5^exp(-p) - 0.5^exp(p) - 0.5
uniroot.all(f, c(-10, 10), tol = 1e-15)
The numerical solution is $p = 0.767049251325708 $. Since $p = \mathcal{MAD}_0 / \beta$, the exact solution looks as follows:
$$ \mathcal{MAD}_0 = 0.767049251325708 \beta. $$Now we can use it in experiments (copy-pastable version: 0.767049251325708 * β
).
Numerical simulation
Let’s double-check that our calculations are correct. Below you can see an R script that generates 1000 random samples from the Gumbel distribution with $\mu = 0,\; \beta = 1$ (1000 elements in each sample). Next, it calculates the MAD estimation for each sample using the Harrell-Davis quantile estimator for the median estimations (see harrell1982). After that, it calculates the median of all MAD estimations and prints the result.
library(evd)
library(Hmisc)
# Setting seed for rgumbel to achieve deterministic result
set.seed(38)
# Harrell-Davis-powered median
hdmedian <- function(x) as.numeric(hdquantile(x, 0.5))
# MAD value which is based on the Harrell-Davis-powered median
hdmad <- function(x) hdmedian(abs(x - hdmedian(x)))
# 1000 trails of MAD estimating for a sample from the Gubmel distribution
mads <- sapply(1:1000, function(i) hdmad(rgumbel(1000)))
# Print median of all trials
hdmedian(mads)
This script’s output is 0.7670284
which is close enough to the exact value 0.767049251325708
.
Conclusion
The Gumbel distribution can be used as a good model of a slightly right-skewed distribution. If we are playing with an algorithm that involves MAD estimations (e.g., MAD-based outlier detector), we can check the precision of our calculations using the exact MAD value which is $\mathcal{MAD}_0 = 0.767049251325708 \beta$.
The general approach can be reused for any other distributions with known median and CDF. You can get the exact $\mathcal{MAD}_0$ value by solving the equation $F(M + \mathcal{MAD}_0) - F(M - \mathcal{MAD}_0) = 0.5$ which is correct for the distribution of any form.
References
- [HoSM]
NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, October 6, 2020. https://doi.org/10.18434/M32189 - [Harrell1982]
Harrell, F.E. and Davis, C.E., 1982. A new distribution-free quantile estimator. Biometrika, 69(3), pp.635-640.
https://pdfs.semanticscholar.org/1a48/9bb74293753023c5bb6bff8e41e8fe68060f.pdf