MP² quantile estimator: estimating the moving median without storing values

Andrey Akinshin · 2021-01-12

In one of the previous posts, I described the P² quantile estimator. It allows estimating quantiles on a stream of numbers without storing them. Such sequential (streaming/online) quantile estimators are useful in software telemetry because they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.

After the publication, I got a lot of questions about moving sequential quantile estimators. Such estimators return quantile values not for the whole stream of numbers, but only for the recent values. So, I wrote another post about a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method). This algorithm gives you the exact value of any order statistics for the last $L$ numbers ($L$ is known as the window size). However, it requires $O(L)$ memory, and it takes $O(log(L))$ time to process each element. This may be acceptable in some cases. Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large $L$.

In this post, I’m going to present a moving modification of the P² quantile estimator. Let’s call it MP² (moving P²). It requires $O(1)$ memory, it takes $O(1)$ to process each element, and it supports windows of any size. Of course, we have a trade-off with the estimation accuracy: it returns a quantile approximation instead of the exact order statistics. However, in most cases, the MP² estimations are pretty accurate from the practical point of view.

Let’s discuss MP² in detail!

The idea

The approach behind the P² quantile estimator is elegant, so I recommend reading it first. However, it’s not required: we are going to use it as a black box. We are going to consider an approach that can be applied for “movification” of any sequential quantile estimator.

Let me show the main idea on a simple example with window size $L = 100$. Imagine that we processed the first 230 elements of our stream $x$:

We want to estimate the quantile value for the last $100$ elements. It gives us the following range (assuming one-based indexing):

$x_{131}..x_{230}$ (the “target” window).

Let’s build a quantile estimator that gives an estimation $E_0$ for the “target” window. Instead of direct quantile estimating inside this window, we are going to work with two non-overlapped fixed-offset windows:

$x_{101}..x_{200}$ (the “previous” window).
$x_{201}..x_{300}$ (the “current” window).

We assume that the quantile estimation for the previous window is known. For the current window, we maintain an independent P² quantile estimator that accumulate the observed values. Let’s introduce the following notation:

$E_1$: the known quantile estimation for the previous window.
$E_2$: the current quantile estimation for the current window based on the P² quantile estimator.
$k$: the number of processed elements inside the current window.

Now we can approximate the quantile value for the target window using the following equation:

$$ E_0 = \dfrac{(L-k) \cdot E_1 + k \cdot E_2}{L}. $$

The target estimation $E_0$ is a weighted sum of two existing estimates $E_1$ and $E_2$:

The weight of $E_1$ is $(L-k)/L$ because the previous window covers $L-k$ elements of the target window.
The weight of $E_2$ is $k/L$ because the current window covers $k$ elements of the target window.

For the above example, $k=30$ (because we have processed only $x_{201}..x_{230}$ from the current window), $L=100$. Thus, $E_0 = ((100-30) E_1 + 30E_2)/100 = 0.7E_1 + 0.3E_2$.

I hope you got the main idea. Now it’s time to formalize the algorithm. Let’s say, we process the $n^{\textrm{th}}$ element. Depending on $n$, we should do the following (assuming one-based indexing):

For $1 \leq n \leq L$:
- Process $x_n$ by the internal P² quantile estimator
- Use $E_0 = E_2$
For $n \geq L,\; n \bmod L \neq 1$:
- Process $x_n$ by the internal P² quantile estimator
- Assign $k = \big( (n + L - 1) \bmod L \big) + 1$
- Use $E_0 = \big( (L-k) \cdot E_1 + k \cdot E_2 \big) / L$
For $n \geq L,\; n \bmod L = 1$:
- Assign $E_1 = E_2$
- Restore the internal P² quantile estimator to the initial state
- Process $x_n$ by the internal P² quantile estimator
- Use $E_0 = \big( (L-k) \cdot E_1 + k \cdot E_2 \big) / L$

Numerical simulations

One of my favorite data sets for testing moving median estimators is a noisy sine wave pattern with high outliers.¹ Let’s try to use the MP² quantile estimator with such a data set (the source code is here):

Here we can do the following observations:

The moving median correctly shows the primary sine wave trend; it’s not affected by the outliers.
The estimations of the MP² quantile estimator (the green “MP²” line in the bottom chart) are pretty close to the true values of the moving median (the orange “True” line in the bottom chart).
On the borders between considered fixed-offset windows, the MP² estimations and the true values are almost synced. These points correspond to the native accuracy of the P² quantile estimator. We have some deviations between these points, but they are usually minor.

Currently, I don’t have the accuracy estimations for the suggested estimator. However, it shows decent results in multiple numerical simulations on synthetic and real data sets.

Reference implementation

If you use C#, you can take an implementation from the latest nightly version (0.3.0-nightly.82+) of Perfolizer (you need MovingP2QuantileEstimator).

Below you can find a standalone copy-pastable implementation of the suggested estimator.

public class P2QuantileEstimator
{
    private readonly double p;
    private readonly int[] n = new int[5];
    private readonly double[] ns = new double[5];
    private readonly double[] dns = new double[5];
    private readonly double[] q = new double[5];
    private int count;

    public P2QuantileEstimator(double probability)
    {
        p = probability;
    }

    public void Add(double value)
    {
        if (count < 5)
        {
            q[count++] = value;
            if (count == 5)
            {
                Array.Sort(q);

                for (int i = 0; i < 5; i++)
                    n[i] = i;

                ns[0] = 0;
                ns[1] = 2 * p;
                ns[2] = 4 * p;
                ns[3] = 2 + 2 * p;
                ns[4] = 4;

                dns[0] = 0;
                dns[1] = p / 2;
                dns[2] = p;
                dns[3] = (1 + p) / 2;
                dns[4] = 1;
            }

            return;
        }

        int k;
        if (value < q[0])
        {
            q[0] = value;
            k = 0;
        }
        else if (value < q[1])
            k = 0;
        else if (value < q[2])
            k = 1;
        else if (value < q[3])
            k = 2;
        else if (value < q[4])
            k = 3;
        else
        {
            q[4] = value;
            k = 3;
        }

        for (int i = k + 1; i < 5; i++)
            n[i]++;
        for (int i = 0; i < 5; i++)
            ns[i] += dns[i];

        for (int i = 1; i <= 3; i++)
        {
            double d = ns[i] - n[i];
            if (d >= 1 && n[i + 1] - n[i] > 1 || d <= -1 && n[i - 1] - n[i] < -1)
            {
                int dInt = Math.Sign(d);
                double qs = Parabolic(i, dInt);
                if (q[i - 1] < qs && qs < q[i + 1])
                    q[i] = qs;
                else
                    q[i] = Linear(i, dInt);
                n[i] += dInt;
            }
        }

        count++;
    }

    private double Parabolic(int i, double d)
    {
        return q[i] + d / (n[i + 1] - n[i - 1]) * (
            (n[i] - n[i - 1] + d) * (q[i + 1] - q[i]) / (n[i + 1] - n[i]) +
            (n[i + 1] - n[i] - d) * (q[i] - q[i - 1]) / (n[i] - n[i - 1])
        );
    }

    private double Linear(int i, int d)
    {
        return q[i] + d * (q[i + d] - q[i]) / (n[i + d] - n[i]);
    }

    public double GetQuantile()
    {
        if (count == 0)
            throw new IndexOutOfRangeException("There are no any values");
        if (count <= 5)
        {
            Array.Sort(q, 0, count);
            int index = (int) Math.Round((count - 1) * p);
            return q[index];
        }

        return q[2];
    }

    public void Clear()
    {
        count = 0;
    }
}

public class MovingP2QuantileEstimator
{
    private readonly P2QuantileEstimator estimator;
    private readonly int windowSize;
    private int n;
    private double previousWindowEstimation;

    public MovingP2QuantileEstimator(double probability, int windowSize)
    {
        this.windowSize = windowSize;
        estimator = new P2QuantileEstimator(probability);
    }

    public void Add(double value)
    {
        n++;
        if (n % windowSize == 0)
        {
            previousWindowEstimation = estimator.GetQuantile();
            estimator.Clear();
        }
        estimator.Add(value);
    }

    public double GetQuantile()
    {
        if (n == 0)
            throw new IndexOutOfRangeException("There are no values");
        if (n < windowSize)
            return estimator.GetQuantile();
        
        double estimation1 = previousWindowEstimation;
        double estimation2 = estimator.GetQuantile();
        double w2 = (n % windowSize + 1) * 1.0 / windowSize;
        double w1 = 1.0 - w2;
        return w1 * estimation1 + w2 * estimation2;
    }
}

Conclusion

Here are the main characteristics of the presented MP² quantile estimator:

Memory: $O(1)$.
It requires only three additional variables in comparison with the P² quantile estimator (the window size $L$, the number of processed elements $n$, the quantile estimation for the previous window $E_1$).
Element processing complexity: $O(1)$.
The algorithm involves a small number of simple arithmetic operations that don’t depend on the window size.
Quantile estimation complexity: $O(1)$.
Since we always have ready $E_1$ and $E_2$ estimations, we have to just evaluate a simple formula.

The accuracy is not perfect, but it should be acceptable in most real-life scenarios.

The MP² quantile estimator may be useful in software performance telemetry. It provides a fast approach to estimate the median (and other quantiles) of collected performance measurements without noticeable memory overhead.

There are to reasons to choose a noisy sine wave pattern with high outliers as a data set for simulations:
- One of the best advantages of the median against other measures of average is its robustness. This means that noise and outliers shouldn’t have a major impact on the median value. Thus, it makes sense to add noise/outliers in the sample.
- The goal of using the moving median is to quickly adapt for changes in a time series. Thus, it makes sense to use a mix of ascent and descent fragments in the sample. The sine wave is one of the simplest function which has this property.
↩︎