## Misleading standard deviation

The standard deviation may be an extremely misleading metric. Even minor deviations from normality could make it completely unreliable and deceiving. Let me demonstrate this problem using an example.

Below you can see three density plots of some distributions. Could you guess their standard deviations?

The correct answers are \(1.0, 3.0, 11.0\). And here is a more challenging problem: could you match these values with the corresponding distributions?

Read more

## Unbiased median absolute deviation based on the Harrell-Davis quantile estimator

The median absolute deviation (\(\textrm{MAD}\)) is a robust measure of scale. In the previous post, I showed how to use the unbiased version of the \(\textrm{MAD}\) estimator as a robust alternative to the standard deviation. “Unbiasedness” means that such estimator’s expected value equals the true value of the standard deviation. Unfortunately, there is such thing as the bias–variance tradeoff: when we remove the bias of the \(\textrm{MAD}\) estimator, we increase its variance and mean squared error (\(\textrm{MSE}\)).

In this post, I want to suggest a more efficient unbiased \(\textrm{MAD}\) estimator. It’s also a consistent estimator for the standard deviation, but it has smaller \(\textrm{MSE}\). To build this estimator, we should replace the classic “straightforward” median estimator with the Harrell-Davis quantile estimator and adjust bias-correction factors. Let’s discuss this approach in detail.

Read more

## Unbiased median absolute deviation

The median absolute deviation (\(\textrm{MAD}\)) is a robust measure of scale. For distribution \(X\), it can be calculated as follows:

\[\textrm{MAD} = C \cdot \textrm{median}(|X - \textrm{median}(X)|) \]

where \(C\) is a constant scale factor. This metric can be used as a robust alternative to the standard deviation. If we want to use the \(\textrm{MAD}\) as a consistent estimator for the standard deviation under the normal distribution, we should set

\[C = C_{\infty} = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. \]

where \(\Phi^{-1}\) is the quantile function of the standard normal distribution (or the inverse of the cumulative distribution function). If \(X\) is the normal distribution, we get \(\textrm{MAD} = \sigma\) where \(\sigma\) is the standard deviation.

Not let’s consider a sample \(x = \{ x_1, x_2, \ldots x_n \}\). Let’s denote the median absolute deviation for a sample of size \(n\) as \(\textrm{MAD}_n\). The corresponding equation looks similar to the definition of \(\textrm{MAD}\) for a distribution:

\[\textrm{MAD}_n = C_n \cdot \textrm{median}(|x - \textrm{median}(x)|). \]

Let’s assume that \(\textrm{median}\) is the straightforward definition of the median (if \(n\) is odd, the median is the middle element of the sorted sample, if \(n\) is even, the median is the arithmetic average of the two middle elements of the sorted sample). We still can use \(C_n = C_{\infty}\) for extremely large sample sizes. However, for small \(n\), \(\textrm{MAD}_n\) becomes a biased estimator. If we want to get an unbiased version, we should adjust the value of \(C_n\).

In this post, we look at the possible approaches and learn the way to get the exact value of \(C_n\) that makes \(\textrm{MAD}_n\) unbiased estimator of the median absolute deviation for any \(n\).

Read more

## Comparing distribution quantiles using gamma effect size

There are several ways to describe the difference between two distributions. Here are a few examples:

- Effect sizes based on differences between means (e.g., Cohen’s d, Glass’ Δ, Hedges’ g)
- The shift and ration functions that estimate differences between matched quantiles.

In one of the previous post, I described the gamma effect size which is defined not for the mean but for quantiles. In this post, I want to share a few case studies that demonstrate how the suggested metric combines the advantages of the above approaches.

Read more

## A single outlier could completely distort your Cohen's d value

Cohen’s d is a popular way to estimate the effect size between two samples. It works excellent for perfectly normal distributions. Usually, people think that slight deviations from normality shouldn’t produce a noticeable impact on the result. Unfortunately, it’s not always true. In fact, a single outlier value can completely distort the result even in large samples.

In this post, I will present some illustrations for this problem and will show how to fix it.

Read more

## Better moving quantile estimations using the partitioning heaps

In one of the previous posts, I have discussed the Hardle-Steiger method.
This algorithm allows estimating the moving median
using \(O(L)\) memory and \(O(log(L))\) element processing complexity (where \(L\) is the window size).
Also, I have shown how to adapt this approach to estimate *any* moving quantile.

In this post, I’m going to present further improvements. The Hardle-Steiger method always returns the order statistics which is the \(k\textrm{th}\) smallest element from the sample. It means that the estimated quantile value always equals one of the last \(L\) observed numbers. However, many of the classic quantile estimators use two elements. For example, if we want to estimate the median for \(x = \{4, 5, 6, 7\}\), some estimators return \(5.5\) (which is the arithmetical mean of \(5\) and \(6\)) instead of \(5\) or \(6\) (which are order statistics).

Let’s learn how to implement a moving version of such estimators using the partitioning heaps from the Hardle-Steiger method.

Read more

## MP² quantile estimator: estimating the moving median without storing values

In one of the previous posts, I described the P² quantile estimator. It allows estimating quantiles on a stream of numbers without storing them. Such sequential (streaming/online) quantile estimators are useful in software telemetry because they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.

After the publication, I got a lot of questions about *moving* sequential quantile estimators.
Such estimators return quantile values not for the whole stream of numbers,
but only for the recent values.
So, I wrote another post about
a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method).
This algorithm gives you the exact value of any order statistics for the last \(L\) numbers
(\(L\) is known as the window size).
However, it requires \(O(L)\) memory, and it takes \(O(log(L))\) time to process each element.
This may be acceptable in some cases.
Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large \(L\).

In this post, I’m going to present a moving modification of the P² quantile estimator. Let’s call it MP² (moving P²). It requires \(O(1)\) memory, it takes \(O(1)\) to process each element, and it supports windows of any size. Of course, we have a trade-off with the estimation accuracy: it returns a quantile approximation instead of the exact order statistics. However, in most cases, the MP² estimations are pretty accurate from the practical point of view.

Let’s discuss MP² in detail!

Read more

## Case study: Accuracy of the MAD estimation using the Harrell-Davis quantile estimator (Gumbel distribution)

In some of my previous posts, I used the median absolute deviation (MAD) to describe the distribution dispersion:

- DoubleMAD outlier detector based on the Harrell-Davis quantile estimator
- Nonparametric Cohen’s d-consistent effect size
- Quantile absolute deviation: estimating statistical dispersion around quantiles

The MAD estimation depends on the chosen median estimator: we may get different MAD values with different median estimators. To get better accuracy, I always encourage readers to use the Harrell-Davis quantile estimator instead of the classic Type 7 quantile estimator.

In this case study, I decided to compare these two quantile estimators using the Gumbel distribution (it’s a good model for slightly right-skewed distributions). According to the performed Monte Carlo simulation, the Harrell-Davis quantile estimator always has better accuracy:

Read more

## Fast implementation of the moving quantile based on the partitioning heaps

Imagine you have a time series. Let’s say, after each new observation, you want to know an “average” value across the last \(L\) observations. Such a metric is known as a moving average (or rolling/running average).

The most popular moving average example is the moving mean. It’s easy to efficiently implement this metric. However, it has a major drawback: it’s not robust. Outliers can easily spoil the moving mean and transform it into a meaningless and untrustable metric.

Fortunately, we have a good alternative: the moving median. Typically, it generates a stable and smooth series of values. In the below figure, you can see the difference between the moving mean and the moving median on noisy data.

The moving median also has a drawback: it’s not easy to efficiently implement it.
Today we going to discuss the Hardle-Steiger method to estimate the median
(memory: \(O(L)\), element processing complexity: \(O(log(L))\), median estimating complexity: \(O(1)\)).
Also, we will learn how to calculate *the moving quantiles* based on this method.

In this post, you will find the following:

- An overview of the Hardle-Steiger method
- A simple way to implement the Hardle-Steiger method
- Moving quantiles inspired by the Hardle-Steiger method
- How to process initial elements
- Reference C# implementation

Read more

## Coverage of quantile confidence intervals

There is a common misunderstanding
that a 95% confidence interval is an interval that covers the true parameter value with 95% probability.
Meanwhile, the correct definition assumes that
the true parameter value will be covered by 95% of 95% confidence intervals *in the long run*.
These two statements sound similar, but there is a huge difference between them.
95% in this context is not a property of a single confidence interval.
Once you get a calculated interval, it may cover the true value (100% probability) or
it may don’t cover it (0% probability).
In fact, 95% is a *prediction* about the percentage of *future* confidence intervals
that cover the true value *in the long run*.

However, even if you know the correct definition, you still may experience some troubles. The first thing people usually forgot is the “long run” part. For example, if we collected 100 samples and calculated a 95% confidence interval of a parameter for each of them, we shouldn’t expect that 95 of these intervals cover the true parameter value. In fact, we can observe a situation when none of these intervals covers the true value. Of course, this is an unlikely event, but if you automatically perform thousands of different experiments, you will definitely get some extreme situations.

The second thing that may create trouble is the “prediction” part. If weather forecasters predicted that it will rain tomorrow, this does not mean that it will rain tomorrow. The same works for statistical predictions. The actual prediction reliability may depend on many factors. If you estimate confidence intervals around the mean for the normal distribution, you are most likely safe. However, if you estimate confidence intervals around quantiles for non-parametric distributions, you should care about the following things:

- The used approach to estimate confidence intervals
- The underlying distribution
- The sample size
- The position of the target quantile

I have already showed how to estimate the confidence interval around the given quantile using the Maritz-Jarrett method. It’s time to verify the reliability of this approach. In this post, I’m going to show some Monte-Carlo simulations that evaluate the coverage percentage in different situations.

Read more