Coverage of quantile confidence intervals

There is a common misunderstanding that a 95% confidence interval is an interval that covers the true parameter value with 95% probability. Meanwhile, the correct definition assumes that the true parameter value will be covered by 95% of 95% confidence intervals in the long run. These two statements sound similar, but there is a huge difference between them. 95% in this context is not a property of a single confidence interval. Once you get a calculated interval, it may cover the true value (100% probability) or it may don’t cover it (0% probability). In fact, 95% is a prediction about the percentage of future confidence intervals that cover the true value in the long run.

However, even if you know the correct definition, you still may experience some troubles. The first thing people usually forgot is the “long run” part. For example, if we collected 100 samples and calculated a 95% confidence interval of a parameter for each of them, we shouldn’t expect that 95 of these intervals cover the true parameter value. In fact, we can observe a situation when none of these intervals covers the true value. Of course, this is an unlikely event, but if you automatically perform thousands of different experiments, you will definitely get some extreme situations.

The second thing that may create trouble is the “prediction” part. If weather forecasters predicted that it will rain tomorrow, this does not mean that it will rain tomorrow. The same works for statistical predictions. The actual prediction reliability may depend on many factors. If you estimate confidence intervals around the mean for the normal distribution, you are most likely safe. However, if you estimate confidence intervals around quantiles for non-parametric distributions, you should care about the following things:

  • The used approach to estimate confidence intervals
  • The underlying distribution
  • The sample size
  • The position of the target quantile

I have already showed how to estimate the confidence interval around the given quantile using the Maritz-Jarrett method. It’s time to verify the reliability of this approach. In this post, I’m going to show some Monte-Carlo simulations that evaluate the coverage percentage in different situations.

Read more

Statistical approaches for performance analysis

Software performance is a complex discipline that requires knowledge in different areas from benchmarking to the internals of modern runtimes, operating systems, and hardware. Surprisingly, the most difficult challenges in performance analysis are not about programming, they are about mathematical statistics!

Many software developers can drill into performance problems and implement excellent optimizations, but they are not always know how to correctly verify these optimizations. This may not look like a problem in the case of a single performance investigation. However, the situation became worse when developers try to set up an infrastructure that should automatically find performance problems or prevent degradations from merging. In order to make such an infrastructure reliable and useful, it’s crucial to achieve an extremely low false-positive rate (otherwise, it’s not trustable) and be able to detect most of the degradations (otherwise, it’s not so useful). It’s not easy if you don’t know which statistical approaches should be used. If you try to google it, you may find thousands of papers about statistics, but only a small portion of them really works in practice.

In this post, I want to share some approaches that I use for performance analysis in everyday life. I have been analyzing performance distributions for the last seven years, and I have found a lot of approaches, metrics, and tricks which nice to have in your statistical toolbox. I would not say that all of them are must have to know, but they can definitely help you to improve the reliability of your statistical checks in different problems of performance analysis. Consider the below list as a letter to a younger version of myself with a brief list of topics that are good to learn.

Read more

Quantile confidence intervals for weighted samples

When you work with non-parametric distributions, quantile estimations are essential to get the main distribution properties. Once you get the estimation values, you may be interested in measuring the accuracy of these estimations. Without it, it’s hard to understand how trustable the obtained values are. One of the most popular ways to evaluate accuracy is confidence interval estimation.

Now imagine that you collect some measurements every day. Each day you get a small sample of values that is not enough to get the accurate daily quantile estimations. However, the full time-series over the last several weeks has a decent size. You suspect that past measurements should be similar to today measurements, but you are not 100% sure about it. You feel a temptation to extend the up-to-date sample by the previously collected values, but it may spoil the estimation (e.g., in the case of recent change points or positive/negative trends).

One of the possible approaches in this situation is to use weighted samples. This assumes that we add past measurements to the “today sample,” but these values should have smaller weight. The older measurement we take, the smaller weight it gets. If you have consistent values across the last several days, this approach works like a charm. If you have any recent changes, you can detect such situations by huge confidence intervals due to the sample inconsistency.

So, how do we estimate confidence intervals around quantiles for the weighted samples? In one of the previous posts, I have already shown how to estimate quantiles on weighted samples. In this post, I will show how to estimate quantile confidence intervals for weighted samples.

Read more

Quantile absolute deviation: estimating statistical dispersion around quantiles

There are many different metrics for statistical dispersion. The most famous one is the standard deviation. The standard deviation is the most popular way to describe the spread around the mean when you work with normally distributed data. However, if you work with non-normal distributions, this metric may be misleading.

In the world of non-parametric distributions, the most common measure of central tendency is the median. For the median, you can describe dispersion using the median absolute deviation around the median (MAD). It works great if the median is the only summary statistic that you care about. However, if you work with multimodal distributions (they can be detected using the lowland multimodality detector), you may be interested in other quantiles as well. So, it makes sense to learn how to describe dispersion around the given quantile. Which metric should we choose?

Recently, I came up with a great solution to this problem. We can generalize the median absolute deviation into the quantile absolute deviation (QAD) around the given quantile based on the Harrell-Davis quantile estimator. I will show how to calculate it, how to interpret it, and how to get insights about distribution properties from images like this one:

Read more

P² quantile estimator: estimating the median without storing values

Imagine that you are implementing performance telemetry in your application. There is an operation that is executed millions of times, and you want to get its “average” duration. It’s not a good idea to use the arithmetic mean because the obtained value can be easily spoiled by outliers. It’s much better to use the median which is one of the most robust ways to describe the average.

The straightforward median estimation approach requires storing all the values. In our case, it’s a bad idea to keep all the values because it will significantly increase the memory footprint. Such telemetry is harmful because it may become a new bottleneck instead of monitoring the actual performance.

Another way to get the median value is to use a sequential quantile estimator (also known as an online quantile estimator or a streaming quantile estimator). This is an algorithm that allows calculating the median value (or any other quantile value) using a fixed amount of memory. Of course, it provides only an approximation of the real median value, but it’s usually enough for typical telemetry use cases.

In this post, I will show one of the simplest sequential quantile estimators that is called the P² quantile estimator (or the Piecewise-Parabolic quantile estimator).

Read more

Plain-text summary notation for multimodal distributions

Let’s say you collected a lot of data and want to explore the underlying distributions of collected samples. If you have only a few distributions, the best way to do that is to look at the density plots (expressed via histograms, kernel density estimations, or quantile-respectful density estimations). However, it’s not always possible.

Suppose you have to process dozens, hundreds, or even thousands of distributions. In that case, it may be extremely time-consuming to manually check visualizations of each distribution. If you analyze distributions from the command line or send notifications about suspicious samples, it may be impossible to embed images in the reports. In these cases, there is a need to present a distribution using plain text.

One way to do that is plain text histograms. Unfortunately, this kind of visualization may occupy o lot of space. In complicated cases, you may need 20 or 30 lines per a single distribution.

Another way is to present classic summary statistics like mean or median, standard deviation or median absolute deviation, quantiles, skewness, and kurtosis. There is another problem here: without experience, it’s hard to reconstruct the true distribution shape based on these values. Even if you are an experienced researcher, the statistical metrics may become misleading in the case of multimodal distributions. Multimodality is one of the most severe challenges in distribution analysis because it distorts basic summary statistics. It’s important to not only find such distribution but also have a way to present brief information about multimodality effects.

So, how can we condense the underlying distribution shape of a given sample to a short text line? I didn’t manage to find an approach that works fine in my cases, so I came up with my own notation. Most of the interpretation problems in my experiments arise from multimodality and outliers, so I decided to focus on these two things and specifically highlight them. Let’s consider this plot:

I suggest describing it like this:

{1.00, 2.00} + [7.16; 13.12]_100 + {19.00} + [27.69; 32.34]_100 + {37.00..39.00}_3

Let me explain the suggested notation in detail.

Read more

Intermodal outliers

Outlier analysis is a typical step in distribution exploration. Usually, we work with the “lower outliers” (extremely low values) and the “upper outliers” (extremely high values). However, outliers are not always extreme values. In the general case, an outlier is a value that significantly differs from other values in the same sample. In the case of multimodal distribution, we can also consider outliers in the middle of the distribution. Let’s call such outliers that we found between modes the “intermodal outliers.”

Look at the above density plot. It’s a bimodal distribution that is formed as a combination of two unimodal distributions. Each of the unimodal distributions may have its own lower and upper outliers. When we merge them, the upper outliers of the first distribution and the lower outliers of the second distribution stop being lower or upper outliers. However, if these values don’t belong to the modes, they still are a subject of interest. In this post, I will show you how to detect such intermodal outliers and how they can be used to form a better distribution description.

Read more

Lowland multimodality detection

Multimodality is an essential feature of a distribution, which may create many troubles during automatic analysis. One of the best ways to work with such distributions is to detect all the modes in advance based on the given samples. Unfortunately, this problem is much harder than it looks like.

I tried many different approaches for multimodality detection, but none of them was good enough. During the past several years, my approach of choice was the mvalue-based modal test by Brendan Gregg. It works nicely in simple cases, but I was constantly stumbling over noisy samples where this algorithm doesn’t produce reliable results. Also, it has some limitations that make it unapplicable to some corner cases.

So, I needed a better approach. Here are my main requirements:

  • It should detect the exact mode locations and ranges
  • It should provide reliable results even on noisy samples
  • It should be able to detect multimodality even when some modes are extremely close to each other
  • It should work out of the box without tricky parameter tuning for each specific distribution

I failed to find such an algorithm anywhere, so I came up with my own! The current working title is “the lowland multimodality detector.” It takes an estimation of the probability density function (PDF) and tries to find “lowlands” (areas that are much lower than neighboring peaks). Next, it splits the plot by these lowlands and detects modes between them. For the PDF estimation, it uses the quantile-respectful density estimation based on the Harrell-Davis quantile estimator (QRDE-HD). Let me explain how it works in detail.

Read more

Quantile-respectful density estimation based on the Harrell-Davis quantile estimator

The idea of this post was born when I was working on a presentation for my recent DotNext talk. It had a slide with a density plot like this:

Here we can see a density plot based on a sample with highlighted decile locations that split the plot into 10 equal parts. Before the conference, I have been reviewed by @VladimirSitnikv. He raised a reasonable concern: it doesn’t look like all the density plot segments are equal and contain exactly 10% of the whole plot. And he was right!

However, I didn’t make any miscalculations. I generated a real sample with 61 elements. Next, I build a density plot with the kernel density estimation (KDE) using the Sheather & Jones method and the normal kernel. Next, I calculated decile values using the Harrell-Davis quantile estimator. Although both the density plot and the decile values are calculated correctly and consistent with the sample, they are not consistent with each other! Indeed, such a density plot is just an estimation of the underlying distribution. It has its own decile values, which are not equal to the sample decile values regardless of the used quantile estimator. This problem is common for different kinds of visualization that presents density and quantiles at the same time (e.g., violin plots)

It leads us to a question: how should we present the shape of our data together with quantile values without confusing inconsistency in the final image? Today I will present a good solution: we should use the quantile-respectful density estimation based on the Harrell-Davis quantile estimator! I know the title is a bit long, but it’s not so complicated as it sounds. In this post, I will show how to build such plots. Also I will compare them to the classic histograms and kernel density estimations. As a bonus, I will demonstrate how awesome these plots are for multimodality detection.

Read more

Misleading histograms

Below you see two histograms. What could you say about them?

Most likely, you say that the first histogram is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both histograms are based on the same sample:

20.13, 19.94, 20.03, 20.06, 20.04, 19.98, 20.15, 19.99, 20.20, 19.99, 20.13, 20.22, 19.86, 19.97, 19.98, 20.06,
29.97, 29.73, 29.75, 30.13, 29.96, 29.82, 29.98, 30.12, 30.18, 29.95, 29.97, 29.82, 30.04, 29.93, 30.04, 30.07,
40.10, 39.93, 40.05, 39.82, 39.92, 39.91, 39.75, 40.00, 40.02, 39.96, 40.07, 39.92, 39.86, 40.04, 39.91, 40.14,
49.95, 50.06, 50.03, 49.92, 50.15, 50.06, 50.00, 50.02, 50.06, 50.00, 49.70, 50.02, 49.96, 50.01, 50.05, 50.13

Thus, the only difference between histograms is the offset!

Visualization is a simple way to understand the shape of your data. Unfortunately, this way may easily become a slippery slope. In the previous post, I have shown how density plots may deceive you when the bandwidth is poorly chosen. Today, we talk about histograms and why you can’t trust them in the general case.

Read more