A Single Outlier Could Completely Distort Your Cohen's D Value

Cohen’s d is a popular way to estimate the effect size between two samples. It works excellent for perfectly normal distributions. Usually, people think that slight deviations from normality shouldn’t produce a noticeable impact on the result. Unfortunately, it’s not always true. In fact, a single outlier value can completely distort the result even in large samples.

In this post, I will present some illustrations for this problem and will show how to fix it.

Read more

Better Moving Quantile Estimations Using the Partitioning Heaps

In one of the previous posts, I have discussed the Hardle-Steiger method. This algorithm allows estimating the moving median using $O(L)$ memory and $O(log(L))$ element processing complexity (where $L$ is the window size). Also, I have shown how to adapt this approach to estimate any moving quantile.

In this post, I’m going to present further improvements. The Hardle-Steiger method always returns the order statistics which is the $k\textrm{th}$ smallest element from the sample. It means that the estimated quantile value always equals one of the last $L$ observed numbers. However, many of the classic quantile estimators use two elements. For example, if we want to estimate the median for $x = \{4, 5, 6, 7\}$, some estimators return $5.5$ (which is the arithmetical mean of $5$ and $6$) instead of $5$ or $6$ (which are order statistics).

Let’s learn how to implement a moving version of such estimators using the partitioning heaps from the Hardle-Steiger method.

Read more

MP² Quantile Estimator: Estimating the Moving Median without Storing Values

In one of the previous posts, I described the P² quantile estimator. It allows estimating quantiles on a stream of numbers without storing them. Such sequential (streaming/online) quantile estimators are useful in software telemetry because they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.

After the publication, I got a lot of questions about moving sequential quantile estimators. Such estimators return quantile values not for the whole stream of numbers, but only for the recent values. So, I wrote another post about a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method). This algorithm gives you the exact value of any order statistics for the last $L$ numbers ($L$ is known as the window size). However, it requires $O(L)$ memory, and it takes $O(log(L))$ time to process each element. This may be acceptable in some cases. Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large $L$.

In this post, I’m going to present a moving modification of the P² quantile estimator. Let’s call it MP² (moving P²). It requires $O(1)$ memory, it takes $O(1)$ to process each element, and it supports windows of any size. Of course, we have a trade-off with the estimation accuracy: it returns a quantile approximation instead of the exact order statistics. However, in most cases, the MP² estimations are pretty accurate from the practical point of view.

Let’s discuss MP² in detail!

Read more

Case Study: Accuracy of the MAD Estimation Using the Harrell-Davis Quantile Estimator (Gumbel Distribution)

In some of my previous posts, I used the median absolute deviation (MAD) to describe the distribution dispersion:

The MAD estimation depends on the chosen median estimator: we may get different MAD values with different median estimators. To get better accuracy, I always encourage readers to use the Harrell-Davis quantile estimator instead of the classic Type 7 quantile estimator.

In this case study, I decided to compare these two quantile estimators using the Gumbel distribution (it’s a good model for slightly right-skewed distributions). According to the performed Monte Carlo simulation, the Harrell-Davis quantile estimator always has better accuracy:

Read more

Fast Implementation of the Moving Quantile Based on the Partitioning Heaps

Imagine you have a time series. Let’s say, after each new observation, you want to know an “average” value across the last $L$ observations. Such a metric is known as a moving average (or rolling/running average).

The most popular moving average example is the moving mean. It’s easy to efficiently implement this metric. However, it has a major drawback: it’s not robust. Outliers can easily spoil the moving mean and transform it into a meaningless and untrustable metric.

Fortunately, we have a good alternative: the moving median. Typically, it generates a stable and smooth series of values. In the below figure, you can see the difference between the moving mean and the moving median on noisy data.

The moving median also has a drawback: it’s not easy to efficiently implement it. Today we going to discuss the Hardle-Steiger method to estimate the median (memory: $O(L)$, element processing complexity: $O(log(L))$, median estimating complexity: $O(1)$). Also, we will learn how to calculate the moving quantiles based on this method.

In this post, you will find the following:

  • An overview of the Hardle-Steiger method
  • A simple way to implement the Hardle-Steiger method
  • Moving quantiles inspired by the Hardle-Steiger method
  • How to process initial elements
  • Reference C# implementation
Read more

Coverage of Quantile Confidence Intervals

There is a common misunderstanding that a 95% confidence interval is an interval that covers the true parameter value with 95% probability. Meanwhile, the correct definition assumes that the true parameter value will be covered by 95% of 95% confidence intervals in the long run. These two statements sound similar, but there is a huge difference between them. 95% in this context is not a property of a single confidence interval. Once you get a calculated interval, it may cover the true value (100% probability) or it may don’t cover it (0% probability). In fact, 95% is a prediction about the percentage of future confidence intervals that cover the true value in the long run.

However, even if you know the correct definition, you still may experience some troubles. The first thing people usually forgot is the “long run” part. For example, if we collected 100 samples and calculated a 95% confidence interval of a parameter for each of them, we shouldn’t expect that 95 of these intervals cover the true parameter value. In fact, we can observe a situation when none of these intervals covers the true value. Of course, this is an unlikely event, but if you automatically perform thousands of different experiments, you will definitely get some extreme situations.

The second thing that may create trouble is the “prediction” part. If weather forecasters predicted that it will rain tomorrow, this does not mean that it will rain tomorrow. The same works for statistical predictions. The actual prediction reliability may depend on many factors. If you estimate confidence intervals around the mean for the normal distribution, you are most likely safe. However, if you estimate confidence intervals around quantiles for non-parametric distributions, you should care about the following things:

  • The used approach to estimate confidence intervals
  • The underlying distribution
  • The sample size
  • The position of the target quantile

I have already showed how to estimate the confidence interval around the given quantile using the Maritz-Jarrett method. It’s time to verify the reliability of this approach. In this post, I’m going to show some Monte-Carlo simulations that evaluate the coverage percentage in different situations.

Read more

Statistical Approaches for Performance Analysis

Date Tags

Software performance is a complex discipline that requires knowledge in different areas from benchmarking to the internals of modern runtimes, operating systems, and hardware. Surprisingly, the most difficult challenges in performance analysis are not about programming, they are about mathematical statistics!

Many software developers can drill into performance problems and implement excellent optimizations, but they are not always know how to correctly verify these optimizations. This may not look like a problem in the case of a single performance investigation. However, the situation became worse when developers try to set up an infrastructure that should automatically find performance problems or prevent degradations from merging. In order to make such an infrastructure reliable and useful, it’s crucial to achieve an extremely low false-positive rate (otherwise, it’s not trustable) and be able to detect most of the degradations (otherwise, it’s not so useful). It’s not easy if you don’t know which statistical approaches should be used. If you try to google it, you may find thousands of papers about statistics, but only a small portion of them really works in practice.

In this post, I want to share some approaches that I use for performance analysis in everyday life. I have been analyzing performance distributions for the last seven years, and I have found a lot of approaches, metrics, and tricks which nice to have in your statistical toolbox. I would not say that all of them are must have to know, but they can definitely help you to improve the reliability of your statistical checks in different problems of performance analysis. Consider the below list as a letter to a younger version of myself with a brief list of topics that are good to learn.

Read more

Quantile Confidence Intervals for Weighted Samples

Update 2021-07-06: the approach was updated using the Kish’s effective sample size.

When you work with non-parametric distributions, quantile estimations are essential to get the main distribution properties. Once you get the estimation values, you may be interested in measuring the accuracy of these estimations. Without it, it’s hard to understand how trustable the obtained values are. One of the most popular ways to evaluate accuracy is confidence interval estimation.

Now imagine that you collect some measurements every day. Each day you get a small sample of values that is not enough to get the accurate daily quantile estimations. However, the full time-series over the last several weeks has a decent size. You suspect that past measurements should be similar to today measurements, but you are not 100% sure about it. You feel a temptation to extend the up-to-date sample by the previously collected values, but it may spoil the estimation (e.g., in the case of recent change points or positive/negative trends).

One of the possible approaches in this situation is to use weighted samples. This assumes that we add past measurements to the “today sample,” but these values should have smaller weight. The older measurement we take, the smaller weight it gets. If you have consistent values across the last several days, this approach works like a charm. If you have any recent changes, you can detect such situations by huge confidence intervals due to the sample inconsistency.

So, how do we estimate confidence intervals around quantiles for the weighted samples? In one of the previous posts, I have already shown how to estimate quantiles on weighted samples. In this post, I will show how to estimate quantile confidence intervals for weighted samples.

Read more

Quantile Absolute Deviation: Estimating Statistical Dispersion around Quantiles

There are many different metrics for statistical dispersion. The most famous one is the standard deviation. The standard deviation is the most popular way to describe the spread around the mean when you work with normally distributed data. However, if you work with non-normal distributions, this metric may be misleading.

In the world of non-parametric distributions, the most common measure of central tendency is the median. For the median, you can describe dispersion using the median absolute deviation around the median (MAD). It works great if the median is the only summary statistic that you care about. However, if you work with multimodal distributions (they can be detected using the lowland multimodality detector), you may be interested in other quantiles as well. So, it makes sense to learn how to describe dispersion around the given quantile. Which metric should we choose?

Recently, I came up with a great solution to this problem. We can generalize the median absolute deviation into the quantile absolute deviation (QAD) around the given quantile based on the Harrell-Davis quantile estimator. I will show how to calculate it, how to interpret it, and how to get insights about distribution properties from images like this one:

Read more

P² Quantile Estimator: Estimating the Median without Storing Values

Update: the estimator accuracy could be improved using a bunch of patches.

Imagine that you are implementing performance telemetry in your application. There is an operation that is executed millions of times, and you want to get its “average” duration. It’s not a good idea to use the arithmetic mean because the obtained value can be easily spoiled by outliers. It’s much better to use the median which is one of the most robust ways to describe the average.

The straightforward median estimation approach requires storing all the values. In our case, it’s a bad idea to keep all the values because it will significantly increase the memory footprint. Such telemetry is harmful because it may become a new bottleneck instead of monitoring the actual performance.

Another way to get the median value is to use a sequential quantile estimator (also known as an online quantile estimator or a streaming quantile estimator). This is an algorithm that allows calculating the median value (or any other quantile value) using a fixed amount of memory. Of course, it provides only an approximation of the real median value, but it’s usually enough for typical telemetry use cases.

In this post, I will show one of the simplest sequential quantile estimators that is called the P² quantile estimator (or the Piecewise-Parabolic quantile estimator).

Read more