## Extended P² quantile estimator

I already covered the P² quantile estimator and its possible implementation improvements in several blog posts. This sequential estimator uses $$O(1)$$ memory and allows estimating a single predefined quantile. Now it’s time to discuss the extended P² quantile estimator that allows estimating multiple predefined quantiles. This extended version was suggested in the paper “Simultaneous estimation of several percentiles”. In this post, we briefly discuss the approach from this paper and how we can improve its implementation.

## P² quantile estimator marker adjusting order

I have already written a few blog posts about the P² quantile estimator (which is a sequential estimator that uses $$O(1)$$ memory):

In this post, we continue improving the P² implementation so that it gives better estimations for streams with a small number of elements.

## P² quantile estimator initialization strategy

Update: the estimator accuracy could be improved using a bunch of patches.

The P² quantile estimator is a sequential estimator that uses $$O(1)$$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I have already written a few blog posts about it:

I tried this estimator in various contexts, and it shows pretty decent results. However, recently I stumbled on a corner case: if we want to estimate extreme quantile ($$p < 0.1$$ or $$p > 0.9$$), this estimator provides inaccurate results on small number streams ($$n < 10$$). While it looks like a minor issue, it would be nice to fix it. In this post, we briefly discuss choosing a better initialization strategy to workaround this problem.

There are multiple ways to compute the “average” value of an array of numbers. One of such ways is the geometric mean. For a sample $$x = \{ x_1, x_2, \ldots, x_n \}$$, the geometric means is defined as follows:

$\operatorname{GM}(x) = \sqrt[n]{x_1 x_2 \ldots x_n}$

This approach is widely recommended for some specific tasks. Let’s say we want to compare the performance of two machines $$M_x$$ and $$M_y$$. In order to do this, we design a set of benchmarks $$b = \{b_1, b_2, \ldots, b_n \}$$ and obtain two sets of measurements $$x = \{ x_1, x_2, \ldots, x_n \}$$ and $$y = \{ y_1, y_2, \ldots, y_n \}$$. Once we have these two samples, we may have a desire to express the difference between two machines as a single number and get a conclusion like “Machine $$M_y$$ works two times faster than $$M_x$$.” I think that this approach is flawed because such a difference couldn’t be expressed as a single number: the result heavily depends on the workloads that we analyze. For example, imagine that $$M_x$$ is a machine with HDD and fast CPU, $$M_y$$ is a machine with SSD and slow CPU. In this case, $$M_x$$ could be faster on CPU-bound workloads while $$M_y$$ could be faster on disk-bound workloads. I really like this summary from “Notes on Calculating Computer Performance” by Bruce Jacob and Trevor Mudge (in the same paper, the authors criticize the approach with the geometric mean):

Performance is therefore not a single number, but really a collection of implications. It is nothing more or less than the measure of how much time we save running our tests on the machines in question. If someone else has similar needs to ours, our performance numbers will be useful to them. However, two people with different sets of criteria will likely walk away with two completely different performance numbers for the same machine.

However, some other authors (e.g., “How not to lie with statistics: the correct way to summarize benchmark results”) actually recommend using the geometric mean to get such a number that describes the performance ratio of $$M_x$$ and $$M_y$$. I have to admit that the geometric mean could provide a reasonable result in some simple cases. Indeed, on normalized numbers, it works much better than the arithmetic mean (that provides meaningless result) because of its nice property: $$\operatorname{GM}(x_i/y_i) = \operatorname{GM}(x_i) / \operatorname{GM}(y_i)$$. However, it doesn’t work properly in the general case. Firstly, the desire to express the difference between two machines is vicious: the result heavily depends on the chosen workloads. Secondly, the performance of a single benchmark $$b_i$$ couldn’t be described as a single number $$x_i$$: we should consider the whole performance distributions. In order to describe the difference between two distributions, we could consider the shift and ration functions (that work much better than the shift and ratio distributions).

Even if you consider a pretty homogenous set of benchmarks and all the distributions are pretty narrow, the geometric mean has severe drawbacks that you should keep in mind. In this post, I briefly cover some of these drawbacks and highlight problems that you may have if you use this metric.

## Matching quantile sets using likelihood based on the binomial coefficients

Let’s say we have a distribution $$X$$ that is given by its $$s$$-quantile values:

$q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1})$

where $$Q_X$$ is the quantile function of $$X$$, $$p_j = j / s$$.

We also have a sample $$y = \{y_1, y_2, \ldots, y_n \}$$ that is given by its $$s$$-quantile estimations:

$q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}),$

where $$Q_y$$ is the quantile estimation function for sample $$y$$. We also assume that $$q_{y_0} = \min(y)$$, $$q_{y_s} = \max(y)$$.

We want to know the likelihood of “$$y$$ is drawn from $$X$$”. In this post, I want to suggest a nice way to do this using the binomial coefficients.

## Ratio function vs. ratio distribution

Let’s say we have two distributions $$X$$ and $$Y$$. In the previous post, we discussed how to express the “absolute difference” between them using the shift function and the shift distribution. Now let’s discuss how to express the “relative difference” between them. This abstract term also could be expressed in various ways. My favorite approach is to build the ratio function. In order to do this, for each quantile $$p$$, we should calculate $$Q_Y(p)/Q_X(p)$$ where $$Q$$ is the quantile function. However, some people prefer using the ratio distribution $$Y/X$$. While both approaches may provide similar results for narrow positive non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

## Shift function vs. shift distribution

Let’s say we have two distributions $$X$$ and $$Y$$, and we want to express the “absolute difference” between them. This abstract term could be expressed in various ways. My favorite approach is to build the Doksum’s shift function. In order to do this, for each quantile $$p$$, we should calculate $$Q_Y(p)-Q_X(p)$$ where $$Q$$ is the quantile function. However, some people prefer using the shift distribution $$Y-X$$. While both approaches may provide similar results for narrow non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

## Preprint announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'

Since the beginning of this year, I have been working on building a quantile estimator that provides an optimal trade-off between statistical efficiency and robustness. Finally, I have built such an estimator. A paper preprint is available on arXiv: arXiv:2111.11776 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-thdqe. You can cite it as follows:

• Andrey Akinshin (2021) Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width, arXiv:2111.11776

## Non-normal median sampling distribution

Let’s consider the classic sample median. If a sample is sorted and the number of sample elements is odd, the median is the middle element. In the case of an even number of sample elements, the median is an arithmetic average of the two middle elements.

Now let’s say we randomly take many samples from the same distribution and calculate the median for each of them. Next, we build a sampling distribution based on these median values. There is a well-known fact that this distribution is asymptotically normal with mean $$M$$ and variance $$1/(4nf^2(M))$$, where $$n$$ is the number of elements in samples, $$f$$ is the probability density function of the original distribution, and $$M$$ is the true median of the original distribution.

Unfortunately, if we try to build such sampling distributions in practice, we may see that they are not always normal. There are some corner cases that prevent us from using the normal model in general. If you implement general routines that analyze the median behavior, you should keep such cases in mind. In this post, we briefly talk about some of these cases.