Middle non-zero quantile absolute deviation

Median absolute deviation (\(\operatorname{MAD}\)) around the median is a popular robust measure of statistical dispersion. Unfortunately, if we work with discrete distributions, we could get zero \(\operatorname{MAD}\) values. It could bring some problems if we use \(\operatorname{MAD}\) as a denominator. Such a problem is also relevant to some other quantile-based measures of dispersion like interquartile range (\(\operatorname{IQR}\)).

This problem could be solved using the quantile absolute deviation around the median. However, it’s not always clear how to choose the right quantile to estimate. In this post, I’m going to suggest a choosing approach that is consistent with the classic \(\operatorname{MAD}\) under continuous distributions (and samples without tied values).

Read more

Unbiased median absolute deviation based on the trimmed Harrell-Davis quantile estimator

The median absolute deviation (\(\operatorname{MAD}\)) is a robust measure of scale. For a sample \(x = \{ x_1, x_2, \ldots, x_n \}\), it’s defined as follows:

\[\operatorname{MAD}_n = C_n \cdot \operatorname{median}(|x - \operatorname{median}(x)|) \]

where \(\operatorname{median}\) is a median estimator, \(C_n\) is a scale factor. Using the right scale factor, we can use \(\operatorname{MAD}\) as a consistent estimator for the estimation of the standard deviation under the normal distribution. For huge samples, we can use the asymptotic value of \(C_n\) which is

\[C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. \]

For small samples, we should use adjusted values \(C_n\) which depend on the sample size. However, \(C_n\) depends not only on the sample size but also on the median estimator. I have already covered how to obtain this values for the traditional median estimator and the Harrell-Davis median estimator. It’s time to get the \(C_n\) values for the trimmed Harrell-Davis median estimator.

Read more

Median absolute deviation vs. Shamos estimator

There are multiple ways to estimate statistical dispersion. The standard deviation is the most popular one, but it’s not robust: a single outlier could heavily corrupt the results. Fortunately, we have robust measures of dispersions like the median absolute deviation and the Shamos estimator. In this post, we perform numerical simulations and compare these two estimators on different distributions and sample sizes.

Read more

Moving extended P² quantile estimator

In the previous posts, I discussed the P² quantile estimator (a sequential estimator which takes \(O(1)\) memory and estimates a single predefined quantile), the moving P² quantile estimator (a moving modification of P² which estimates quantiles within the moving window), and the extended P² quantile estimator (a sequential estimator which takes \(O(m)\) memory and estimates \(m\) predefined quantiles).

Now it’s time to build the moving modification of the extended P² quantile estimator which estimates \(m\) predefined quantiles using \(O(m)\) memory within the moving window.

Read more

Extended P² quantile estimator

I already covered the P² quantile estimator and its possible implementation improvements in several blog posts. This sequential estimator uses \(O(1)\) memory and allows estimating a single predefined quantile. Now it’s time to discuss the extended P² quantile estimator that allows estimating multiple predefined quantiles. This extended version was suggested in the paper “Simultaneous estimation of several percentiles”. In this post, we briefly discuss the approach from this paper and how we can improve its implementation.

Read more

P² quantile estimator marker adjusting order

I have already written a few blog posts about the P² quantile estimator (which is a sequential estimator that uses \(O(1)\) memory):

In this post, we continue improving the P² implementation so that it gives better estimations for streams with a small number of elements.

Read more

P² quantile estimator initialization strategy

Update: the estimator accuracy could be improved using a bunch of patches.

The P² quantile estimator is a sequential estimator that uses \(O(1)\) memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I have already written a few blog posts about it:

I tried this estimator in various contexts, and it shows pretty decent results. However, recently I stumbled on a corner case: if we want to estimate extreme quantile (\(p < 0.1\) or \(p > 0.9\)), this estimator provides inaccurate results on small number streams (\(n < 10\)). While it looks like a minor issue, it would be nice to fix it. In this post, we briefly discuss choosing a better initialization strategy to workaround this problem.

Read more

Misleading geometric mean

There are multiple ways to compute the “average” value of an array of numbers. One of such ways is the geometric mean. For a sample \(x = \{ x_1, x_2, \ldots, x_n \}\), the geometric means is defined as follows:

\[\operatorname{GM}(x) = \sqrt[n]{x_1 x_2 \ldots x_n} \]

This approach is widely recommended for some specific tasks. Let’s say we want to compare the performance of two machines \(M_x\) and \(M_y\). In order to do this, we design a set of benchmarks \(b = \{b_1, b_2, \ldots, b_n \}\) and obtain two sets of measurements \(x = \{ x_1, x_2, \ldots, x_n \}\) and \(y = \{ y_1, y_2, \ldots, y_n \}\). Once we have these two samples, we may have a desire to express the difference between two machines as a single number and get a conclusion like “Machine \(M_y\) works two times faster than \(M_x\).” I think that this approach is flawed because such a difference couldn’t be expressed as a single number: the result heavily depends on the workloads that we analyze. For example, imagine that \(M_x\) is a machine with HDD and fast CPU, \(M_y\) is a machine with SSD and slow CPU. In this case, \(M_x\) could be faster on CPU-bound workloads while \(M_y\) could be faster on disk-bound workloads. I really like this summary from “Notes on Calculating Computer Performance” by Bruce Jacob and Trevor Mudge (in the same paper, the authors criticize the approach with the geometric mean):

Performance is therefore not a single number, but really a collection of implications. It is nothing more or less than the measure of how much time we save running our tests on the machines in question. If someone else has similar needs to ours, our performance numbers will be useful to them. However, two people with different sets of criteria will likely walk away with two completely different performance numbers for the same machine.

However, some other authors (e.g., “How not to lie with statistics: the correct way to summarize benchmark results”) actually recommend using the geometric mean to get such a number that describes the performance ratio of \(M_x\) and \(M_y\). I have to admit that the geometric mean could provide a reasonable result in some simple cases. Indeed, on normalized numbers, it works much better than the arithmetic mean (that provides meaningless result) because of its nice property: \(\operatorname{GM}(x_i/y_i) = \operatorname{GM}(x_i) / \operatorname{GM}(y_i)\). However, it doesn’t work properly in the general case. Firstly, the desire to express the difference between two machines is vicious: the result heavily depends on the chosen workloads. Secondly, the performance of a single benchmark \(b_i\) couldn’t be described as a single number \(x_i\): we should consider the whole performance distributions. In order to describe the difference between two distributions, we could consider the shift and ration functions (that work much better than the shift and ratio distributions).

Even if you consider a pretty homogenous set of benchmarks and all the distributions are pretty narrow, the geometric mean has severe drawbacks that you should keep in mind. In this post, I briefly cover some of these drawbacks and highlight problems that you may have if you use this metric.

Read more

Matching quantile sets using likelihood based on the binomial coefficients

Let’s say we have a distribution \(X\) that is given by its \(s\)-quantile values:

\[q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1}) \]

where \(Q_X\) is the quantile function of \(X\), \(p_j = j / s\).

We also have a sample \(y = \{y_1, y_2, \ldots, y_n \}\) that is given by its \(s\)-quantile estimations:

\[q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}), \]

where \(Q_y\) is the quantile estimation function for sample \(y\). We also assume that \(q_{y_0} = \min(y)\), \(q_{y_s} = \max(y)\).

We want to know the likelihood of “\(y\) is drawn from \(X\)”. In this post, I want to suggest a nice way to do this using the binomial coefficients.

Read more

Ratio function vs. ratio distribution

Let’s say we have two distributions \(X\) and \(Y\). In the previous post, we discussed how to express the “absolute difference” between them using the shift function and the shift distribution. Now let’s discuss how to express the “relative difference” between them. This abstract term also could be expressed in various ways. My favorite approach is to build the ratio function. In order to do this, for each quantile \(p\), we should calculate \(Q_Y(p)/Q_X(p)\) where \(Q\) is the quantile function. However, some people prefer using the ratio distribution \(Y/X\). While both approaches may provide similar results for narrow positive non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

Read more