There are multiple ways to compute the “average” value of an array of numbers. One of such ways is the geometric mean. For a sample $$x = \{ x_1, x_2, \ldots, x_n \}$$, the geometric means is defined as follows:

$\operatorname{GM}(x) = \sqrt[n]{x_1 x_2 \ldots x_n}$

This approach is widely recommended for some specific tasks. Let’s say we want to compare the performance of two machines $$M_x$$ and $$M_y$$. In order to do this, we design a set of benchmarks $$b = \{b_1, b_2, \ldots, b_n \}$$ and obtain two sets of measurements $$x = \{ x_1, x_2, \ldots, x_n \}$$ and $$y = \{ y_1, y_2, \ldots, y_n \}$$. Once we have these two samples, we may have a desire to express the difference between two machines as a single number and get a conclusion like “Machine $$M_y$$ works two times faster than $$M_x$$.” I think that this approach is flawed because such a difference couldn’t be expressed as a single number: the result heavily depends on the workloads that we analyze. For example, imagine that $$M_x$$ is a machine with HDD and fast CPU, $$M_y$$ is a machine with SSD and slow CPU. In this case, $$M_x$$ could be faster on CPU-bound workloads while $$M_y$$ could be faster on disk-bound workloads. I really like this summary from “Notes on Calculating Computer Performance” by Bruce Jacob and Trevor Mudge (in the same paper, the authors criticize the approach with the geometric mean):

Performance is therefore not a single number, but really a collection of implications. It is nothing more or less than the measure of how much time we save running our tests on the machines in question. If someone else has similar needs to ours, our performance numbers will be useful to them. However, two people with different sets of criteria will likely walk away with two completely different performance numbers for the same machine.

However, some other authors (e.g., “How not to lie with statistics: the correct way to summarize benchmark results”) actually recommend using the geometric mean to get such a number that describes the performance ratio of $$M_x$$ and $$M_y$$. I have to admit that the geometric mean could provide a reasonable result in some simple cases. Indeed, on normalized numbers, it works much better than the arithmetic mean (that provides meaningless result) because of its nice property: $$\operatorname{GM}(x_i/y_i) = \operatorname{GM}(x_i) / \operatorname{GM}(y_i)$$. However, it doesn’t work properly in the general case. Firstly, the desire to express the difference between two machines is vicious: the result heavily depends on the chosen workloads. Secondly, the performance of a single benchmark $$b_i$$ couldn’t be described as a single number $$x_i$$: we should consider the whole performance distributions. In order to describe the difference between two distributions, we could consider the shift and ration functions (that work much better than the shift and ratio distributions).

Even if you consider a pretty homogenous set of benchmarks and all the distributions are pretty narrow, the geometric mean has severe drawbacks that you should keep in mind. In this post, I briefly cover some of these drawbacks and highlight problems that you may have if you use this metric.

## Matching quantile sets using likelihood based on the binomial coefficients

Let’s say we have a distribution $$X$$ that is given by its $$s$$-quantile values:

$q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1})$

where $$Q_X$$ is the quantile function of $$X$$, $$p_j = j / s$$.

We also have a sample $$y = \{y_1, y_2, \ldots, y_n \}$$ that is given by its $$s$$-quantile estimations:

$q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}),$

where $$Q_y$$ is the quantile estimation function for sample $$y$$. We also assume that $$q_{y_0} = \min(y)$$, $$q_{y_s} = \max(y)$$.

We want to know the likelihood of “$$y$$ is drawn from $$X$$”. In this post, I want to suggest a nice way to do this using the binomial coefficients.

## Ratio function vs. ratio distribution

Let’s say we have two distributions $$X$$ and $$Y$$. In the previous post, we discussed how to express the “absolute difference” between them using the shift function and the shift distribution. Now let’s discuss how to express the “relative difference” between them. This abstract term also could be expressed in various ways. My favorite approach is to build the ratio function. In order to do this, for each quantile $$p$$, we should calculate $$Q_Y(p)/Q_X(p)$$ where $$Q$$ is the quantile function. However, some people prefer using the ratio distribution $$Y/X$$. While both approaches may provide similar results for narrow positive non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

## Shift function vs. shift distribution

Let’s say we have two distributions $$X$$ and $$Y$$, and we want to express the “absolute difference” between them. This abstract term could be expressed in various ways. My favorite approach is to build the Doksum’s shift function. In order to do this, for each quantile $$p$$, we should calculate $$Q_Y(p)-Q_X(p)$$ where $$Q$$ is the quantile function. However, some people prefer using the shift distribution $$Y-X$$. While both approaches may provide similar results for narrow non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

## Preprint announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'

Update: the final paper was published in Communications in Statistics - Simulation and Computation (DOI: 10.1080/03610918.2022.2050396).

Since the beginning of this year, I have been working on building a quantile estimator that provides an optimal trade-off between statistical efficiency and robustness. Finally, I have built such an estimator. A paper preprint is available on arXiv: arXiv:2111.11776 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-thdqe. You can cite it as follows:

• Andrey Akinshin (2021) Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width, arXiv:2111.11776

## Non-normal median sampling distribution

Let’s consider the classic sample median. If a sample is sorted and the number of sample elements is odd, the median is the middle element. In the case of an even number of sample elements, the median is an arithmetic average of the two middle elements.

Now let’s say we randomly take many samples from the same distribution and calculate the median for each of them. Next, we build a sampling distribution based on these median values. There is a well-known fact that this distribution is asymptotically normal with mean $$M$$ and variance $$1/(4nf^2(M))$$, where $$n$$ is the number of elements in samples, $$f$$ is the probability density function of the original distribution, and $$M$$ is the true median of the original distribution.

Unfortunately, if we try to build such sampling distributions in practice, we may see that they are not always normal. There are some corner cases that prevent us from using the normal model in general. If you implement general routines that analyze the median behavior, you should keep such cases in mind. In this post, we briefly talk about some of these cases.

I already discussed misleadingness of such metrics like standard deviation and skewness. It’s time to discuss misleadingness of the measure of tailedness: kurtosis (which, sometimes, could be incorrectly interpreted as a measure of peakedness). Typically, the concept of kurtosis is explained with the help of images like this:

Unfortunately, the raw kurtosis value may provide wrong insights about distribution properties. In this post, we briefly discuss the sources of its misleadingness:

• There are multiple definitions of kurtosis. The most significant confusion arises between “kurtosis” and “excess kurtosis,” but there are other definitions of this measure.
• Kurtosis may work fine for unimodal distributions, but it performs not so clear for multimodal distributions.
• The classic definition of kurtosis is not robust: it could be easily spoiled by extreme outliers.

Skewness is a commonly used measure of the asymmetry of the probability distributions. A typical skewness interpretation comes down to an image like this:

It looks extremely simple: using the skewness sign, we get an idea of the distribution form and the arrangement of the mean and the median. Unfortunately, it doesn’t always work as expected. Skewness estimation could be a highly misleading metric (even more misleading than the standard deviation). In this post, I discuss four sources of its misleadingness:

• “Skewness” is a generic term; it has multiple definitions. When a skewness value is presented, you can’t always guess the underlying equation without additional details.
• Skewness is “designed” for unimodal distributions; it’s meaningless in the case of multimodality.
• Most default skewness definitions are not robust: a single outlier could completely distort the skewness value.
• We can’t make conclusions about the locations of the mean and the median based on the skewness sign.

## Greenwald-Khanna quantile estimator

The Greenwald-Khanna quantile estimator is a classic sequential quantile estimator which has the following features:

• It allows estimating quantiles with respect to the given precision $$\epsilon$$.
• It requires $$O(\frac{1}{\epsilon} log(\epsilon N))$$ memory in the worst case.
• It doesn’t require knowledge of the total number of elements in the sequence and the positions of the requested quantiles.

In this post, I briefly explain the basic idea of the underlying data structure, and share a copy-pastable C# implementation. At the end of the post, I discuss some important implementation decisions that are unclear from the original paper, but heavily affect the estimator accuracy.

The P² quantile estimator is a sequential estimator that uses $$O(1)$$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I already wrote a blog post about this approach and added its implementation in perfolizer. Recently, I got a bug report that revealed a flaw of the original paper. In this post, I’m going to briefly discuss this issue and the corresponding fix.