P² Quantile Estimator
Imagine that you are implementing performance telemetry in your application. There is an operation that is executed millions of times, and you want to get its “average” duration. It’s not a good idea to use the arithmetic mean because the obtained value can be easily spoiled by outliers. It’s much better to use the median which is one of the most robust ways to describe the average.
The straightforward median estimation approach requires storing all the values. In our case, it’s a bad idea to keep all the values because it will significantly increase the memory footprint. Such telemetry is harmful because it may become a new bottleneck instead of monitoring the actual performance.
Another way to get the median value is to use a sequential quantile estimator (also known as an online quantile estimator or a streaming quantile estimator). This is an algorithm that allows calculating the median value (or any other quantile value) using a fixed amount of memory. Of course, it provides only an approximation of the real median value, but it’s usually enough for typical telemetry use cases.
In this post, I will show one of the simplest sequential quantile estimators that is called the P² quantile estimator (or the Piecewise-Parabolic quantile estimator).
The P² quantile estimator
This algorithm was initially suggested in jain1985. Below you can find a short overview of this approach, notes about typos in the original paper, numerical simulation, and a C# implementation.
The main idea
Let’s say we have a stream of observations $\{ x_0, x_1, x_2, x_3, x_4, \ldots \}$ and we want to estimate p-quantile. The suggested approach introduces five markers that correspond to the estimations of
- $q_0$: The minimum
- $q_1$: The (p/2)-quantile
- $q_2$: The p-quantile
- $q_3$: The ((1+p)/2)-quantile
- $q_4$: The maximum
The $q_i$ values are known as the marker heights.
Also, we have to maintain the marker positions $\{ n_0, n_1, n_2, n_3, n_4 \}$. These integer values describe actual marker indexes across obtained observations at the moment.
Next, we have to define the marker desired positions $\{ n'_0, n'_1, n'_2, n'_3, n'_4 \}$. For the first $n$ observations, these real values are defined as follows:
- $n'_0 = 0$
- $n'_1 = (n - 1) p / 2$
- $n'_2 = (n - 1) p$
- $n'_3 = (n - 1) (1 + p) / 2$
- $n'_4 = (n - 1)$
In order to speed up the algorithm, we can precalculate increments of the desired positions which should be added to the current values after each new observation:
- $dn'_0 = 0$
- $dn'_1 = p / 2$
- $dn'_2 = p$
- $dn'_3 = (1 + p) / 2$
- $dn'_4 = 1$
Note that in the original paper, the authors use one-based indexing. I decided to adapt it to the zero-based indexing which is more convenient from the implementation point of view.
Initialization
Once we collected the first five elements, we should perform initialization logic:
$$ \left\{ \begin{array}{llll} q_0 = x_{(0)}, & n_0 = 0, & n'_0 = 0, & dn'_0 = 0,\\ q_1 = x_{(1)}, & n_1 = 1, & n'_1 = 2p, & dn'_1 = p/2,\\ q_2 = x_{(2)}, & n_2 = 2, & n'_2 = 4p, & dn'_2 = p,\\ q_3 = x_{(3)}, & n_3 = 3, & n'_3 = 2 + 2p, & dn'_3 = (1+p)/2,\\ q_4 = x_{(4)}, & n_4 = 4, & n'_4 = 4, & dn'_4 = 1. \end{array} \right. $$Marker invalidation
For each $x_j$ for $j \geq 5$, we should invalidate our markers.
Firstly, we should adjust extreme marker heights (if $x_j < q_0$, we should update $q_0$; if $x_j > q_4$, we should update $q_4$) and find $k$ such that $q_k \leq x_j < q_{k+1}$ (or $q_k \leq x_j \leq q_{k+1}$ for $k=3$):
Condition | $q_i$ update | k |
---|---|---|
$\phantom{q_0 \leq~} x_j < q_0$ | $q_0 = x_j$ | 0 |
$q_0 \leq x_j < q_1$ | 0 | |
$q_1 \leq x_j < q_2$ | 1 | |
$q_2 \leq x_j < q_3$ | 2 | |
$q_3 \leq x_j < q_4$ | 3 | |
$q_4 \leq x_j$ | $q_4 = x_j$ | 3 |
Secondly, we should update the marker positions and the marker desired positions:
$$ \begin{array}{lcl} n_i = n_i + 1 & \textrm{for} & i = k + 1, \ldots, 4; \\ n'_i = n'_i + dn'_i & \textrm{for} & i = 0, \ldots, 4. \\ \end{array} $$Finally, we should adjust non-extreme marker heights ($q_i$) and positions ($n_i$) for $i \in \{ 1, 2, 3\} $ in the following way:
for (i = 1; i <= 3; i++)
{
d = nꞌ[i] - n[i]
if (d >= 1 && n[i + 1] - n[i] > 1 ||
d <= -1 && n[i - 1] - n[i] < -1)
{
d = sign(d)
qꞌ = Parabolic(i, d)
if (!(q[i - 1] < qꞌ && qꞌ < q[i + 1]))
qꞌ = Linear(i, d)
q[i] = qꞌ
n[i] += d
}
}
The core equation of the algorithm is a piecewise-parabolic prediction (P²) formula that adjusts marker heights for each observation:
$$ q'_i = q_i + \dfrac{d}{n_{i+1}-n_{i-1}} \cdot \Bigg( (n_i-n_{i-1}+d)\dfrac{q_{i+1}-q_i}{n_{i+1}-n_i} + (n_{i+1}-n_i-d)\dfrac{q_i-q_{i-1}}{n_i-n_{i-1}} \Bigg). $$Once we calculated $q'_i$, we should check that $q_{i-1} < q'_i < q_{i+1}$. If this condition is false, we should ignore the parabolic prediction and use the linear prediction instead:
$$ q'_i = q_i + d \dfrac{q_{i+d}-q_i}{n_{i+d}-n_{i}}. $$The result
Once you need the requested quantile estimation value, we should just take the value of $q_2$.
Typos in the original paper
A find a few typos in the original paper which may confuse readers who want to implement the algorithm from scratch:
- Page 1079, Box 1, B2:
$i = k, \ldots, 5$
should be replaced by$i = k + 1, \ldots, 5$
- Page 1079, Box 1, B3:
$\textbf{THEN}\; q_i \leftarrow q_i$
should be replaced by$\textbf{THEN}\; q_i \leftarrow q'_i$
Numerical simulation
It’s time to check how it works. I decided to visualize sequential values of the following quantiles estimator:
- The P² quantile estimator
A sequential estimator that is described above. - The Type 7 quantile estimator
It’s the most popular quantile estimator which is used by default in R, Julia, NumPy, Excel (PERCENTILE
,PERCENTILE.INC
), Python (inclusive
method). We call it “Type 7” according to notation from hyndman1996, where Rob J. Hyndman and Yanan Fan described nine quantile algorithms which are used in statistical computer packages. - The Harrell-Davis quantile estimator
It’s my favorite option in real life for non-sequential cases because it’s more efficient than classic quantile estimators based on linear interpolation, and it provides more reliable estimations on small samples. This quantile estimator is described in harrell1982. - Actual
The true median value which is taken from the underlying distribution.
Below, you can find several plots for the following distributions:
- Normal distribution $\mathcal{N}(0, 1)$
- Gumbel distribution for $\mu = 0, \beta = 1$
- Beta distribution $\textrm{Beta}(10, 2)$
- Uniform distribution $\mathcal{U}(0, 1)$
- Bimodal distribution (mixture of $\mathcal{N}(10, 1)$ and $\mathcal{N}(20, 1)$)
Here are the results:
As you can see, The P² quantile estimator produces reasonable median estimates. I also checked how it works on a considerable number of real data sets and I’m pretty satisfied with the results. You can also find a discussion about accuracy and the equation for the mean squared error in the original paper.
Reference implementation
Below you can find a C# implementation of the discussed algorithm. Also, you can use it via the latest nightly version (0.3.0-nightly.64+) of perfolizer.
Update: an updated implementation is available here.
public class P2QuantileEstimator
{
private readonly double p;
private readonly int[] n = new int[5]; // marker positions
private readonly double[] ns = new double[5]; // desired marker positions
private readonly double[] dns = new double[5];
private readonly double[] q = new double[5]; // marker heights
private int count;
public P2QuantileEstimator(double probability)
{
p = probability;
}
public void AddValue(double x)
{
if (count < 5)
{
q[count++] = x;
if (count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
dns[0] = 0;
dns[1] = p / 2;
dns[2] = p;
dns[3] = (1 + p) / 2;
dns[4] = 1;
}
return;
}
int k;
if (x < q[0])
{
q[0] = x;
k = 0;
}
else if (x < q[1])
k = 0;
else if (x < q[2])
k = 1;
else if (x < q[3])
k = 2;
else if (x < q[4])
k = 3;
else
{
q[4] = x;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
for (int i = 0; i < 5; i++)
ns[i] += dns[i];
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (count <= 5)
{
Array.Sort(q, 0, count);
int index = (int) Math.Round((count - 1) * p);
return q[index];
}
return q[2];
}
}
Conclusion
The P² quantile estimator allows estimating quantile values on a stream of numbers without storing individual values. There are many other sequential quantile estimators, but the P² quantile estimator is one of the most simple from the implementation point of view. Meanwhile, it provides good accuracy and low overhead. The algorithm can be used to introduce performance telemetry in your application without noticeable performance or memory footprint overhead.
Update: the estimator accuracy could be improved using a bunch of patches.
The P² quantile estimator is a sequential estimator that uses $O(1)$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I already wrote a blog post about this approach and added its implementation in perfolizer. Recently, I got a bug report that revealed a flaw of the original paper. In this post, I’m going to briefly discuss this issue and the corresponding fix.
Introduction
I already described the P² quantile estimator algorithm in one of the previous blog posts. This approach includes maintaining of five markers $\{ n'_0, n'_1, n'_2, n'_3, n'_4 \}$. If the considered sequence currently contains exactly $n$ elements, the marker values are defined as follows:
- $n'_0 = 0$
- $n'_1 = (n - 1) p / 2$
- $n'_2 = (n - 1) p$
- $n'_3 = (n - 1) (1 + p) / 2$
- $n'_4 = (n - 1)$
After adding another element in the sequence, the marker values should be invalidated. The original paper suggests using the marker increments to “reduce CPU overhead”:
- $dn'_0 = 0$
- $dn'_1 = p / 2$
- $dn'_2 = p$
- $dn'_3 = (1 + p) / 2$
- $dn'_4 = 1$
@AnthonyLloyd has reported that such approach has rounding issues. Let’s try to fix this problem and see how it affects the approach.
Experiments
Let’s implement the classic approach with increments (P2QuantileEstimatorOriginal
)
and the new one without increments (P2QuantileEstimatorPatched
):
public class P2QuantileEstimatorOriginal
{
private readonly double p;
private readonly int[] n = new int[5]; // marker positions
private readonly double[] ns = new double[5]; // desired marker positions
private readonly double[] dns = new double[5];
private readonly double[] q = new double[5]; // marker heights
private int count;
public P2QuantileEstimatorOriginal(double probability)
{
p = probability;
}
public void AddValue(double x)
{
if (count < 5)
{
q[count++] = x;
if (count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
dns[0] = 0;
dns[1] = p / 2;
dns[2] = p;
dns[3] = (1 + p) / 2;
dns[4] = 1;
}
return;
}
int k;
if (x < q[0])
{
q[0] = x;
k = 0;
}
else if (x < q[1])
k = 0;
else if (x < q[2])
k = 1;
else if (x < q[3])
k = 2;
else if (x < q[4])
k = 3;
else
{
q[4] = x;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
for (int i = 0; i < 5; i++)
ns[i] += dns[i];
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (count <= 5)
{
Array.Sort(q, 0, count);
int index = (int)Math.Round((count - 1) * p);
return q[index];
}
return q[2];
}
}
public class P2QuantileEstimatorPatched
{
private readonly double p;
private readonly int[] n = new int[5]; // marker positions
private readonly double[] ns = new double[5]; // desired marker positions
private readonly double[] q = new double[5]; // marker heights
private int count;
public P2QuantileEstimatorPatched(double probability)
{
p = probability;
}
public void AddValue(double x)
{
if (count < 5)
{
q[count++] = x;
if (count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
}
return;
}
int k;
if (x < q[0])
{
q[0] = x;
k = 0;
}
else if (x < q[1])
k = 0;
else if (x < q[2])
k = 1;
else if (x < q[3])
k = 2;
else if (x < q[4])
k = 3;
else
{
q[4] = x;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
ns[1] = count * p / 2;
ns[2] = count * p;
ns[3] = count * (1 + p) / 2;
ns[4] = count;
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (count <= 5)
{
Array.Sort(q, 0, count);
int index = (int)Math.Round((count - 1) * p);
return q[index];
}
return q[2];
}
}
The main change is replacing
for (int i = 0; i < 5; i++)
ns[i] += dns[i];
by
ns[1] = count * p / 2;
ns[2] = count * p;
ns[3] = count * (1 + p) / 2;
ns[4] = count;
Now let’s benchmark it using BenchmarkDotNet:
[LongRunJob]
public class P2Benchmarks
{
[Benchmark]
public void Original()
{
var estimator = new P2QuantileEstimatorOriginal(0.5);
for (int i = 0; i < 1_000_000; i++)
estimator.AddValue(i);
}
[Benchmark]
public void Patched()
{
var estimator = new P2QuantileEstimatorPatched(0.5);
for (int i = 0; i < 1_000_000; i++)
estimator.AddValue(i);
}
}
class Program
{
static void Main()
{
BenchmarkRunner.Run<P2Benchmarks>();
}
}
Here are the benchmark results:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1288 (20H2/October2020Update)
Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.300
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
LongRun : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
| Method | Mean | Error | StdDev |
|--------- |---------:|---------:|---------:|
| Original | 24.47 ms | 0.028 ms | 0.138 ms |
| Patched | 21.72 ms | 0.026 ms | 0.133 ms |
As we can see, the Patched
version works even faster than the Original
.
Of course, the difference in performance could be explained
by the manual loop unrolling and decreasing the number of operations
(we do not update $n'_0$ in Patched
since $dn'_0 = 0$).
Of course, further optimizations are possible.
The current implementation doesn’t aim to be the fastest one, it aims to be the readable one.
The most important thing here is that the patch doesn’t lead to a performance regression.
Meanwhile, it also reduces the memory overhead (we shouldn’t keep the dns
array with five double
elements anymore).
Now it’s time to check the impact on the estimator result. Let’s consider the following code snippet:
var random = new Random(1729);
var original = new P2QuantileEstimatorOriginal(0.6);
var patched = new P2QuantileEstimatorPatched(0.6);
for (int i = 0; i < 100; i++)
{
var x = random.NextDouble();
original.AddValue(x);
patched.AddValue(x);
}
Console.WriteLine("Original : " + original.GetQuantile());
Console.WriteLine("Patched : " + patched.GetQuantile());
Here is the corresponding output:
Original : 0.6094896389457989
Patched : 0.6053711159656534
As we can see, the difference is noticeable.
Conclusion
The suggested change:
- Improves performance keeping the same readability level
- Reduces memory overhead
- Fixes the internal calculations
Thus, I have decided to update the corresponding implementation in perfolizer. The fix is available in perfolizer v0.3.0-nightly.106+. Here is the updated copy-pastable version of the reference implementation:
public class P2QuantileEstimator
{
private readonly double p;
private readonly int[] n = new int[5]; // marker positions
private readonly double[] ns = new double[5]; // desired marker positions
private readonly double[] q = new double[5]; // marker heights
private int count;
public P2QuantileEstimator(double probability)
{
p = probability;
}
public void AddValue(double x)
{
if (count < 5)
{
q[count++] = x;
if (count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
}
return;
}
int k;
if (x < q[0])
{
q[0] = x;
k = 0;
}
else if (x < q[1])
k = 0;
else if (x < q[2])
k = 1;
else if (x < q[3])
k = 2;
else if (x < q[4])
k = 3;
else
{
q[4] = x;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
ns[1] = count * p / 2;
ns[2] = count * p;
ns[3] = count * (1 + p) / 2;
ns[4] = count;
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (count <= 5)
{
Array.Sort(q, 0, count);
int index = (int)Math.Round((count - 1) * p);
return q[index];
}
return q[2];
}
}
Update: the estimator accuracy could be improved using a bunch of patches.
The P² quantile estimator is a sequential estimator that uses $O(1)$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I have already written a few blog posts about it:
I tried this estimator in various contexts, and it shows pretty decent results. However, recently I stumbled on a corner case: if we want to estimate extreme quantile ($p < 0.1$ or $p > 0.9$), this estimator provides inaccurate results on small number streams ($n < 10$). While it looks like a minor issue, it would be nice to fix it. In this post, we briefly discuss choosing a better initialization strategy to workaround this problem.
The problem
I assume that you have already read the original post and understand the approach behind P² quantile estimator. Once we observed the first five elements, the original paper jain1985 suggests using the following initial values:
$$ \left\{ \begin{array}{lll} n'_0 = 0, & n_0 = 0, & q_0 = x_{(0)},\\ n'_1 = 2p, & n_1 = 1, & q_1 = x_{(1)},\\ n'_2 = 4p, & n_2 = 2, & q_2 = x_{(2)},\\ n'_3 = 2 + 2p, & n_3 = 3, & q_3 = x_{(3)},\\ n'_4 = 4, & n_4 = 4, & q_4 = x_{(4)}. \end{array} \right. $$Thus, the initial value of $q_2$ (which holds our current estimation of the target quantile) is the sample median of the first five elements. It’s a good estimation for $p=0.5$. Unfortunately, in the extreme cases ($p < 0.1$ or $p > 0.9$) such an estimation is not accurate. This inaccuracy is not noticeable after processing a huge number of stream elements (e.g., $n > 100$), but it leads to confusing results on small samples ($n < 10$).
The solution
Since we know the desired marker positions $n'_i$, let’s choose our initial marker parameters according to these positions:
$$ \left\{ \begin{array}{lll} n'_0 = 0, & n_0 = \lfloor n'_0 \rceil, & q_0 = x_{(n_0)},\\ n'_1 = 2p, & n_1 = \lfloor n'_1 \rceil, & q_1 = x_{(n_1)},\\ n'_2 = 4p, & n_2 = \lfloor n'_2 \rceil, & q_2 = x_{(n_2)},\\ n'_3 = 2 + 2p, & n_3 = \lfloor n'_3 \rceil, & q_3 = x_{(n_3)},\\ n'_4 = 4, & n_4 = \lfloor n'_4 \rceil, & q_4 = x_{(n_4)}, \end{array} \right. $$where $\lfloor n'_i \rceil$ is the rounding operator.
New Reference Implementation
public class P2QuantileEstimator
{
private readonly double p;
private readonly InitializationStrategy strategy;
private readonly int[] n = new int[5];
private readonly double[] ns = new double[5];
private readonly double[] q = new double[5];
public int Count { get; private set; }
public enum InitializationStrategy
{
Classic,
Adaptive
}
public P2QuantileEstimator(double probability, InitializationStrategy strategy = InitializationStrategy.Classic)
{
p = probability;
this.strategy = strategy;
}
public void Add(double value)
{
if (Count < 5)
{
q[Count++] = value;
if (Count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
if (strategy == InitializationStrategy.Adaptive)
{
Array.Copy(q, ns, 5);
n[1] = (int)Math.Round(2 * p);
n[2] = (int)Math.Round(4 * p);
n[3] = (int)Math.Round(2 + 2 * p);
q[1] = ns[n[1]];
q[2] = ns[n[2]];
q[3] = ns[n[3]];
}
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
}
return;
}
int k;
if (value < q[0])
{
q[0] = value;
k = 0;
}
else if (value < q[1])
k = 0;
else if (value < q[2])
k = 1;
else if (value < q[3])
k = 2;
else if (value < q[4])
k = 3;
else
{
q[4] = value;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
ns[1] = Count * p / 2;
ns[2] = Count * p;
ns[3] = Count * (1 + p) / 2;
ns[4] = Count;
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
Count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (Count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (Count <= 5)
{
Array.Sort(q, 0, Count);
int index = (int)Math.Round((Count - 1) * p);
return q[index];
}
return q[2];
}
public void Clear()
{
Count = 0;
}
}
Simulation study
Now let’s perform the following experiment. We enumerate different distributions (the standard uniform, the standard normal, the Gumbel), different sample sizes (6, 7, 8), and different quantile probabilities (0.05, 0.1, 0.2, 0.8, 0.9, 0.95). For each combination of the input parameters, we perform 10000 simulations of the sample experiment: generate a random sample of the given size from the given distribution and estimate the given quantile using the classic initialization strategy and the new adaptive initialization strategy. As a baseline, we use the traditional Hyndman-Fan Type 7 quantile estimator. The initialization strategy that gives a better estimation (compared to the baseline) is the “winner” of the corresponding experiment. For each combination of the input parameters, we calculate the percentage of wins for each strategy. Here is the source code of this simulation:
var random = new Random(1729);
var distributions = new IContinuousDistribution[]
{
new UniformDistribution(0, 1),
new NormalDistribution(0, 1),
new GumbelDistribution()
};
Console.WriteLine(" Classic Adaptive");
foreach (int n in new[] { 6, 7, 8 })
foreach (var probability in new Probability[] { 0.05, 0.1, 0.2, 0.8, 0.9, 0.95 })
foreach (var distribution in distributions)
{
var randomGenerator = distribution.Random(random);
const int totalIterations = 10_000;
int classicIsWinner = 0;
for (int iteration = 0; iteration < totalIterations; iteration++)
{
var p2ClassicEstimator = new P2QuantileEstimator(probability, P2QuantileEstimator.InitializationStrategy.Classic);
var p2AdaptiveEstimator = new P2QuantileEstimator(probability, P2QuantileEstimator.InitializationStrategy.Adaptive);
var values = new List<double>();
for (int i = 0; i < n; i++)
{
double x = randomGenerator.Next();
values.Add(x);
p2ClassicEstimator.Add(x);
p2AdaptiveEstimator.Add(x);
}
double simpleEstimation = SimpleQuantileEstimator.Instance.GetQuantile(values, probability);
double p2ClassicEstimation = p2ClassicEstimator.GetQuantile();
double p2AdaptiveEstimation = p2AdaptiveEstimator.GetQuantile();
if (Math.Abs(p2ClassicEstimation - simpleEstimation) < Math.Abs(p2AdaptiveEstimation - simpleEstimation))
classicIsWinner++;
}
int adaptiveIsWinner = totalIterations - classicIsWinner;
string title = distribution.GetType().Name.Replace("Distribution", "").PadRight(7) + " " +
"P" + (probability * 100).ToString().PadRight(2) + " " +
"N" + n;
Console.WriteLine($"{title,-15}: {classicIsWinner / 100.0,6:N2}% {(adaptiveIsWinner / 100.0),6:N2}%");
}
And here are the results:
Classic Adaptive
Uniform P5 N6 : 1.31% 98.69%
Normal P5 N6 : 1.86% 98.14%
Gumbel P5 N6 : 1.63% 98.37%
Uniform P10 N6 : 2.61% 97.39%
Normal P10 N6 : 3.60% 96.40%
Gumbel P10 N6 : 3.27% 96.73%
Uniform P20 N6 : 12.05% 87.95%
Normal P20 N6 : 14.20% 85.80%
Gumbel P20 N6 : 13.16% 86.84%
Uniform P80 N6 : 0.96% 99.04%
Normal P80 N6 : 1.45% 98.55%
Gumbel P80 N6 : 2.58% 97.42%
Uniform P90 N6 : 0.00% 100.00%
Normal P90 N6 : 0.00% 100.00%
Gumbel P90 N6 : 0.00% 100.00%
Uniform P95 N6 : 0.00% 100.00%
Normal P95 N6 : 0.00% 100.00%
Gumbel P95 N6 : 0.00% 100.00%
Uniform P5 N7 : 4.38% 95.62%
Normal P5 N7 : 5.48% 94.52%
Gumbel P5 N7 : 5.34% 94.66%
Uniform P10 N7 : 16.04% 83.96%
Normal P10 N7 : 21.63% 78.37%
Gumbel P10 N7 : 18.08% 81.92%
Uniform P20 N7 : 24.15% 75.85%
Normal P20 N7 : 27.10% 72.90%
Gumbel P20 N7 : 26.10% 73.90%
Uniform P80 N7 : 10.73% 89.27%
Normal P80 N7 : 13.36% 86.64%
Gumbel P80 N7 : 13.17% 86.83%
Uniform P90 N7 : 10.07% 89.93%
Normal P90 N7 : 14.74% 85.26%
Gumbel P90 N7 : 17.34% 82.66%
Uniform P95 N7 : 1.26% 98.74%
Normal P95 N7 : 1.58% 98.42%
Gumbel P95 N7 : 1.89% 98.11%
Uniform P5 N8 : 7.54% 92.46%
Normal P5 N8 : 9.58% 90.42%
Gumbel P5 N8 : 8.44% 91.56%
Uniform P10 N8 : 23.83% 76.17%
Normal P10 N8 : 32.72% 67.28%
Gumbel P10 N8 : 28.00% 72.00%
Uniform P20 N8 : 35.11% 64.89%
Normal P20 N8 : 39.02% 60.98%
Gumbel P20 N8 : 37.02% 62.98%
Uniform P80 N8 : 24.83% 75.17%
Normal P80 N8 : 28.79% 71.21%
Gumbel P80 N8 : 29.60% 70.40%
Uniform P90 N8 : 18.42% 81.58%
Normal P90 N8 : 24.63% 75.37%
Gumbel P90 N8 : 28.92% 71.08%
Uniform P95 N8 : 4.25% 95.75%
Normal P95 N8 : 4.98% 95.02%
Gumbel P95 N8 : 5.51% 94.49%
As we can see, the new “adaptive” strategy shows much better results than the classic one.
I have already written a few blog posts about the P² quantile estimator (which is a sequential estimator that uses $O(1)$ memory):
- P² Quantile Estimator
- P² quantile estimator rounding issue
- P² quantile estimator initialization strategy
In this post, we continue improving the P² implementation so that it gives better estimations for streams with a small number of elements.
The problem
In the previous post, I have performed the following experiment:
We enumerate different distributions (the standard uniform, the standard normal), different sample sizes (6, 7, 8), and different quantile probabilities (0.05, 0.1, 0.2, 0.8, 0.9, 0.95). For each combination of the input parameters, we perform 10000 simulations of the sample experiment: generate a random sample of the given size from the given distribution and estimate the given quantile using the classic initialization strategy and the new adaptive initialization strategy. As a baseline, we use the traditional Hyndman-Fan Type 7 quantile estimator. The initialization strategy that gives a better estimation (compared to the baseline) is the “winner” of the corresponding experiment. For each combination of the input parameters, we calculate the percentage of wins for each strategy.
Here is the source code of this simulation:
var random = new Random(1729);
var distributions = new IContinuousDistribution[]
{
new UniformDistribution(0, 1),
new NormalDistribution(0, 1)
};
Console.WriteLine(" Classic Adaptive");
foreach (var distribution in distributions)
foreach (int n in new[] { 6, 7, 8 })
foreach (var probability in new Probability[] { 0.05, 0.1, 0.2, 0.8, 0.9, 0.95 })
{
var randomGenerator = distribution.Random(random);
const int totalIterations = 10_000;
int classicIsWinner = 0;
for (int iteration = 0; iteration < totalIterations; iteration++)
{
var p2ClassicEstimator = new P2QuantileEstimator(probability, P2QuantileEstimator.InitializationStrategy.Classic);
var p2AdaptiveEstimator = new P2QuantileEstimator(probability, P2QuantileEstimator.InitializationStrategy.Adaptive);
var values = new List<double>();
for (int i = 0; i < n; i++)
{
double x = randomGenerator.Next();
values.Add(x);
p2ClassicEstimator.Add(x);
p2AdaptiveEstimator.Add(x);
}
double simpleEstimation = SimpleQuantileEstimator.Instance.GetQuantile(values, probability);
double p2ClassicEstimation = p2ClassicEstimator.GetQuantile();
double p2AdaptiveEstimation = p2AdaptiveEstimator.GetQuantile();
if (Math.Abs(p2ClassicEstimation - simpleEstimation) < Math.Abs(p2AdaptiveEstimation - simpleEstimation))
classicIsWinner++;
}
int adaptiveIsWinner = totalIterations - classicIsWinner;
string title = distribution.GetType().Name.Replace("Distribution", "").PadRight(7) + " " +
"P" + (probability * 100).ToString().PadRight(2) + " " +
"N" + n;
Console.WriteLine($"{title,-15}: {classicIsWinner / 100.0,6:N2}% {(adaptiveIsWinner / 100.0),6:N2}%");
}
I got the following results for the Uniform and the Normal distributions:
Classic Adaptive
Uniform P5 N6 : 1.31% 98.69%
Uniform P10 N6 : 2.62% 97.38%
Uniform P20 N6 : 11.44% 88.56%
Uniform P80 N6 : 0.97% 99.03%
Uniform P90 N6 : 0.00% 100.00%
Uniform P95 N6 : 0.00% 100.00%
Uniform P5 N7 : 4.00% 96.00%
Uniform P10 N7 : 14.68% 85.32%
Uniform P20 N7 : 24.52% 75.48%
Uniform P80 N7 : 10.41% 89.59%
Uniform P90 N7 : 10.31% 89.69%
Uniform P95 N7 : 1.14% 98.86%
Uniform P5 N8 : 7.50% 92.50%
Uniform P10 N8 : 22.87% 77.13%
Uniform P20 N8 : 35.12% 64.88%
Uniform P80 N8 : 24.98% 75.02%
Uniform P90 N8 : 17.81% 82.19%
Uniform P95 N8 : 3.94% 96.06%
Normal P5 N6 : 1.73% 98.27%
Normal P10 N6 : 3.80% 96.20%
Normal P20 N6 : 13.93% 86.07%
Normal P80 N6 : 1.55% 98.45%
Normal P90 N6 : 0.00% 100.00%
Normal P95 N6 : 0.00% 100.00%
Normal P5 N7 : 5.86% 94.14%
Normal P10 N7 : 21.15% 78.85%
Normal P20 N7 : 27.72% 72.28%
Normal P80 N7 : 12.34% 87.66%
Normal P90 N7 : 14.50% 85.50%
Normal P95 N7 : 1.54% 98.46%
Normal P5 N8 : 9.18% 90.82%
Normal P10 N8 : 32.63% 67.37%
Normal P20 N8 : 38.88% 61.12%
Normal P80 N8 : 28.81% 71.19%
Normal P90 N8 : 25.45% 74.55%
Normal P95 N8 : 4.92% 95.08%
Since both the Normal and the Uniform distributions are symmetric, we could expect symmetric results. However, the result table is asymmetric: the obtained numbers for P5/P10/P20 don’t match the corresponding numbers for P80/P90/P95.
The solution
The final stage of the P² quantile estimator suggests adjusting non-extreme marker heights ($q_i$) and positions ($n_i$) for $i \in \{ 1, 2, 3\} $ (see the algorithm description and the original paper jain1985 for details):
for (i = 1; i <= 3; i++)
{
d = nꞌ[i] - n[i]
if (d >= 1 && n[i + 1] - n[i] > 1 ||
d <= -1 && n[i - 1] - n[i] < -1)
{
d = sign(d)
qꞌ = Parabolic(i, d)
if (!(q[i - 1] < qꞌ && qꞌ < q[i + 1]))
qꞌ = Linear(i, d)
q[i] = qꞌ
n[i] += d
}
}
The core equation of the algorithm is a piecewise-parabolic prediction (P²) formula that adjusts marker heights for each observation:
$$ q'_i = q_i + \dfrac{d}{n_{i+1}-n_{i-1}} \cdot \Bigg( (n_i-n_{i-1}+d)\dfrac{q_{i+1}-q_i}{n_{i+1}-n_i} + (n_{i+1}-n_i-d)\dfrac{q_i-q_{i-1}}{n_i-n_{i-1}} \Bigg). $$Once we calculated $q'_i$, we should check that $q_{i-1} < q'_i < q_{i+1}$. If this condition is false, we should ignore the parabolic prediction and use the linear prediction instead:
$$ q'_i = q_i + d \dfrac{q_{i+d}-q_i}{n_{i+d}-n_{i}}. $$The problem arises when the number of elements in a stream is small and we use an adjusted initialization strategy. Since we could have collisions across $\{ n_i \}$, the order of adjusting is important. Currently, it’s optimal only for higher quantiles ($p > 0.5$), but not for lower quantiles ($p < 0.5$). Let’s extract the adjusting logic from the above snippet to named method:
for (i = 1; i <= 3; i++)
Adjust(i);
Now it’s easy to introduce an adaptive adjusting order depending on the value of $p$:
if (p >= 0.5)
{
for (int i = 1; i <= 3; i++)
Adjust(i);
}
else
{
for (int i = 3; i >= 1; i--)
Adjust(i);
}
if we run the simulation from the beginning of the post, we get the following result:
Classic Adaptive
Uniform P5 N6 : 0.00% 100.00%
Uniform P10 N6 : 0.00% 100.00%
Uniform P20 N6 : 0.84% 99.16%
Uniform P80 N6 : 0.97% 99.03%
Uniform P90 N6 : 0.00% 100.00%
Uniform P95 N6 : 0.00% 100.00%
Uniform P5 N7 : 1.19% 98.81%
Uniform P10 N7 : 9.47% 90.53%
Uniform P20 N7 : 10.77% 89.23%
Uniform P80 N7 : 10.41% 89.59%
Uniform P90 N7 : 10.31% 89.69%
Uniform P95 N7 : 1.14% 98.86%
Uniform P5 N8 : 3.91% 96.09%
Uniform P10 N8 : 17.48% 82.52%
Uniform P20 N8 : 25.13% 74.87%
Uniform P80 N8 : 24.98% 75.02%
Uniform P90 N8 : 17.81% 82.19%
Uniform P95 N8 : 3.94% 96.06%
Normal P5 N6 : 0.00% 100.00%
Normal P10 N6 : 0.00% 100.00%
Normal P20 N6 : 1.63% 98.37%
Normal P80 N6 : 1.55% 98.45%
Normal P90 N6 : 0.00% 100.00%
Normal P95 N6 : 0.00% 100.00%
Normal P5 N7 : 1.81% 98.19%
Normal P10 N7 : 13.87% 86.13%
Normal P20 N7 : 12.85% 87.15%
Normal P80 N7 : 12.34% 87.66%
Normal P90 N7 : 14.50% 85.50%
Normal P95 N7 : 1.54% 98.46%
Normal P5 N8 : 4.84% 95.16%
Normal P10 N8 : 25.05% 74.95%
Normal P20 N8 : 28.43% 71.57%
Normal P80 N8 : 28.81% 71.19%
Normal P90 N8 : 25.45% 74.55%
Normal P95 N8 : 4.92% 95.08%
Now it looks quite symmetric. The problem is solved!
The updated reference implementation
public class P2QuantileEstimator
{
private readonly double p;
private readonly InitializationStrategy strategy;
private readonly int[] n = new int[5];
private readonly double[] ns = new double[5];
private readonly double[] q = new double[5];
public int Count { get; private set; }
public enum InitializationStrategy
{
Classic,
Adaptive
}
public P2QuantileEstimator(double probability,
InitializationStrategy strategy = InitializationStrategy.Classic)
{
p = probability;
this.strategy = strategy;
}
public void Add(double value)
{
if (Count < 5)
{
q[Count++] = value;
if (Count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
if (strategy == InitializationStrategy.Adaptive)
{
Array.Copy(q, ns, 5);
n[1] = (int)Math.Round(2 * p);
n[2] = (int)Math.Round(4 * p);
n[3] = (int)Math.Round(2 + 2 * p);
q[1] = ns[n[1]];
q[2] = ns[n[2]];
q[3] = ns[n[3]];
}
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
}
return;
}
int k;
if (value < q[0])
{
q[0] = value;
k = 0;
}
else if (value < q[1])
k = 0;
else if (value < q[2])
k = 1;
else if (value < q[3])
k = 2;
else if (value < q[4])
k = 3;
else
{
q[4] = value;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
ns[1] = Count * p / 2;
ns[2] = Count * p;
ns[3] = Count * (1 + p) / 2;
ns[4] = Count;
if (p >= 0.5)
{
for (int i = 1; i <= 3; i++)
Adjust(i);
}
else
{
for (int i = 3; i >= 1; i--)
Adjust(i);
}
Count++;
}
private void Adjust(int i)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (Count == 0)
throw new InvalidOperationException("Sequence contains no elements");
if (Count <= 5)
{
Array.Sort(q, 0, Count);
int index = (int)Math.Round((Count - 1) * p);
return q[index];
}
return q[2];
}
public void Clear()
{
Count = 0;
}
}
See also
The P² algorithm for dynamic calculation of quantiles and histograms without storing observatio... · 1985
· Raj Jain
et al.
Simultaneous estimation of several percentiles · 1987
· Kimmo E E Raatikainen