In one of the previous posts, I described the P² quantile estimator. It allows estimating quantiles on a stream of numbers without storing them. Such sequential (streaming/online) quantile estimators are useful in software telemetry because they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.
After the publication, I got a lot of questions about moving sequential quantile estimators. Such estimators return quantile values not for the whole stream of numbers, but only for the recent values. So, I wrote another post about a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method). This algorithm gives you the exact value of any order statistics for the last \(L\) numbers (\(L\) is known as the window size). However, it requires \(O(L)\) memory, and it takes \(O(log(L))\) time to process each element. This may be acceptable in some cases. Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large \(L\).
In this post, I’m going to present a moving modification of the P² quantile estimator. Let’s call it MP² (moving P²). It requires \(O(1)\) memory, it takes \(O(1)\) to process each element, and it supports windows of any size. Of course, we have a trade-off with the estimation accuracy: it returns a quantile approximation instead of the exact order statistics. However, in most cases, the MP² estimations are pretty accurate from the practical point of view.
Let’s discuss MP² in detail!
The idea
The approach behind the P² quantile estimator is elegant, so I recommend reading it first. However, it’s not required: we are going to use it as a black box. We are going to consider an approach that can be applied for “movification” of any sequential quantile estimator.
Let me show the main idea on a simple example with window size \(L = 100\). Imagine that we processed the first 230 elements of our stream \(x\):
We want to estimate the quantile value for the last \(100\) elements. It gives us the following range (assuming one-based indexing):
- \(x_{131}..x_{230}\) (the “target” window).
Let’s build a quantile estimator that gives an estimation \(E_0\) for the “target” window. Instead of direct quantile estimating inside this window, we are going to work with two non-overlapped fixed-offset windows:
- \(x_{101}..x_{200}\) (the “previous” window).
- \(x_{201}..x_{300}\) (the “current” window).
We assume that the quantile estimation for the previous window is known. For the current window, we maintain an independent P² quantile estimator that accumulate the observed values. Let’s introduce the following notation:
- \(E_1\): the known quantile estimation for the previous window.
- \(E_2\): the current quantile estimation for the current window based on the P² quantile estimator.
- \(k\): the number of processed elements inside the current window.
Now we can approximate the quantile value for the target window using the following equation:
\[E_0 = \dfrac{(L-k) \cdot E_1 + k \cdot E_2}{L}. \]
The target estimation \(E_0\) is a weighted sum of two existing estimates \(E_1\) and \(E_2\):
- The weight of \(E_1\) is \((L-k)/L\) because the previous window covers \(L-k\) elements of the target window.
- The weight of \(E_2\) is \(k/L\) because the current window covers \(k\) elements of the target window.
For the above example, \(k=30\) (because we have processed only \(x_{201}..x_{230}\) from the current window), \(L=100\). Thus, \(E_0 = ((100-30) E_1 + 30E_2)/100 = 0.7E_1 + 0.3E_2\).
I hope you got the main idea. Now it’s time to formalize the algorithm. Let’s say, we process the \(n^{\textrm{th}}\) element. Depending on \(n\), we should do the following (assuming one-based indexing):
- For \(1 \leq n \leq L\):
- Process \(x_n\) by the internal P² quantile estimator
- Use \(E_0 = E_2\)
- For \(n \geq L,\; n \bmod L \neq 1\):
- Process \(x_n\) by the internal P² quantile estimator
- Assign \(k = \big( (n + L - 1) \bmod L \big) + 1\)
- Use \(E_0 = \big( (L-k) \cdot E_1 + k \cdot E_2 \big) / L\)
- For \(n \geq L,\; n \bmod L = 1\):
- Assign \(E_1 = E_2\)
- Restore the internal P² quantile estimator to the initial state
- Process \(x_n\) by the internal P² quantile estimator
- Use \(E_0 = \big( (L-k) \cdot E_1 + k \cdot E_2 \big) / L\)
Numerical simulations
One of my favorite data sets for testing moving median estimators is a noisy sine wave pattern with high outliers.1 Let’s try to use the MP² quantile estimator with such a data set (the source code is here):
Here we can do the following observations:
- The moving median correctly shows the primary sine wave trend; it’s not affected by the outliers.
- The estimations of the MP² quantile estimator (the green “MP²” line in the bottom chart) are pretty close to the true values of the moving median (the orange “True” line in the bottom chart).
- On the borders between considered fixed-offset windows, the MP² estimations and the true values are almost synced. These points correspond to the native accuracy of the P² quantile estimator. We have some deviations between these points, but they are usually minor.
Currently, I don’t have the accuracy estimations for the suggested estimator. However, it shows decent results in multiple numerical simulations on synthetic and real data sets.
Reference implementation
If you use C#, you can take an implementation from
the latest nightly version (0.3.0-nightly.82+) of Perfolizer
(you need MovingP2QuantileEstimator
).
Below you can find a standalone copy-pastable implementation of the suggested estimator.
public class P2QuantileEstimator
{
private readonly double p;
private readonly int[] n = new int[5];
private readonly double[] ns = new double[5];
private readonly double[] dns = new double[5];
private readonly double[] q = new double[5];
private int count;
public P2QuantileEstimator(double probability)
{
p = probability;
}
public void Add(double value)
{
if (count < 5)
{
q[count++] = value;
if (count == 5)
{
Array.Sort(q);
for (int i = 0; i < 5; i++)
n[i] = i;
ns[0] = 0;
ns[1] = 2 * p;
ns[2] = 4 * p;
ns[3] = 2 + 2 * p;
ns[4] = 4;
dns[0] = 0;
dns[1] = p / 2;
dns[2] = p;
dns[3] = (1 + p) / 2;
dns[4] = 1;
}
return;
}
int k;
if (value < q[0])
{
q[0] = value;
k = 0;
}
else if (value < q[1])
k = 0;
else if (value < q[2])
k = 1;
else if (value < q[3])
k = 2;
else if (value < q[4])
k = 3;
else
{
q[4] = value;
k = 3;
}
for (int i = k + 1; i < 5; i++)
n[i]++;
for (int i = 0; i < 5; i++)
ns[i] += dns[i];
for (int i = 1; i <= 3; i++)
{
double d = ns[i] - n[i];
if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
{
int dInt = Math.Sign(d);
double qs = Parabolic(i, dInt);
if (q[i - 1] < qs && qs < q[i + 1])
q[i] = qs;
else
q[i] = Linear(i, dInt);
n[i] += dInt;
}
}
count++;
}
private double Parabolic(int i, double d)
{
return q[i] + d / (n[i + 1] - n[i - 1]) * (
(n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
(n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
);
}
private double Linear(int i, int d)
{
return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
}
public double GetQuantile()
{
if (count == 0)
throw new IndexOutOfRangeException("There are no any values");
if (count <= 5)
{
Array.Sort(q, 0, count);
int index = (int) Math.Round((count - 1) * p);
return q[index];
}
return q[2];
}
public void Clear()
{
count = 0;
}
}
public class MovingP2QuantileEstimator
{
private readonly P2QuantileEstimator estimator;
private readonly int windowSize;
private int n;
private double previousWindowEstimation;
public MovingP2QuantileEstimator(double probability, int windowSize)
{
this.windowSize = windowSize;
estimator = new P2QuantileEstimator(probability);
}
public void Add(double value)
{
n++;
if (n % windowSize == 0)
{
previousWindowEstimation = estimator.GetQuantile();
estimator.Clear();
}
estimator.Add(value);
}
public double GetQuantile()
{
if (n == 0)
throw new IndexOutOfRangeException("There are no values");
if (n < windowSize)
return estimator.GetQuantile();
double estimation1 = previousWindowEstimation;
double estimation2 = estimator.GetQuantile();
double w2 = (n % windowSize + 1) * 1.0 / windowSize;
double w1 = 1.0 - w2;
return w1 * estimation1 + w2 * estimation2;
}
}
Conclusion
Here are the main characteristics of the presented MP² quantile estimator:
- Memory: \(O(1)\).
It requires only three additional variables in comparison with the P² quantile estimator (the window size \(L\), the number of processed elements \(n\), the quantile estimation for the previous window \(E_1\)). - Element processing complexity: \(O(1)\).
The algorithm involves a small number of simple arithmetic operations that don’t depend on the window size. - Quantile estimation complexity: \(O(1)\).
Since we always have ready \(E_1\) and \(E_2\) estimations, we have to just evaluate a simple formula.
The accuracy is not perfect, but it should be acceptable in most real-life scenarios.
The MP² quantile estimator may be useful in software performance telemetry. It provides a fast approach to estimate the median (and other quantiles) of collected performance measurements without noticeable memory overhead.
There are to reasons to choose a noisy sine wave pattern with high outliers as a data set for simulations:
- One of the best advantages of the median against other measures of average is its robustness. This means that noise and outliers shouldn’t have a major impact on the median value. Thus, it makes sense to add noise/outliers in the sample.
- The goal of using the moving median is to quickly adapt for changes in a time series. Thus, it makes sense to use a mix of ascent and descent fragments in the sample. The sine wave is one of the simplest function which has this property.