Quantile-respectful density estimation based on the Harrell-Davis quantile estimator

The idea of this post was born when I was working on a presentation for my recent DotNext talk. It had a slide with a density plot like this:

Here we can see a density plot based on a sample with highlighted decile locations that split the plot into 10 equal parts. Before the conference, I have been reviewed by @VladimirSitnikv. He raised a reasonable concern: it doesn’t look like all the density plot segments are equal and contain exactly 10% of the whole plot. And he was right!

However, I didn’t make any miscalculations. I generated a real sample with 61 elements. Next, I build a density plot with the kernel density estimation (KDE) using the Sheather & Jones method and the normal kernel. Next, I calculated decile values using the Harrell-Davis quantile estimator. Although both the density plot and the decile values are calculated correctly and consistent with the sample, they are not consistent with each other! Indeed, such a density plot is just an estimation of the underlying distribution. It has its own decile values, which are not equal to the sample decile values regardless of the used quantile estimator. This problem is common for different kinds of visualization that presents density and quantiles at the same time (e.g., violin plots)

It leads us to a question: how should we present the shape of our data together with quantile values without confusing inconsistency in the final image? Today I will present a good solution: we should use the quantile-respectful density estimation based on the Harrell-Davis quantile estimator! I know the title is a bit long, but it’s not so complicated as it sounds. In this post, I will show how to build such plots. Also I will compare them to the classic histograms and kernel density estimations. As a bonus, I will demonstrate how awesome these plots are for multimodality detection.

Read more

Misleading histograms

Below you see two histograms. What could you say about them?

Most likely, you say that the first histogram is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both histograms are based on the same sample:

20.13, 19.94, 20.03, 20.06, 20.04, 19.98, 20.15, 19.99, 20.20, 19.99, 20.13, 20.22, 19.86, 19.97, 19.98, 20.06,
29.97, 29.73, 29.75, 30.13, 29.96, 29.82, 29.98, 30.12, 30.18, 29.95, 29.97, 29.82, 30.04, 29.93, 30.04, 30.07,
40.10, 39.93, 40.05, 39.82, 39.92, 39.91, 39.75, 40.00, 40.02, 39.96, 40.07, 39.92, 39.86, 40.04, 39.91, 40.14,
49.95, 50.06, 50.03, 49.92, 50.15, 50.06, 50.00, 50.02, 50.06, 50.00, 49.70, 50.02, 49.96, 50.01, 50.05, 50.13

Thus, the only difference between histograms is the offset!

Visualization is a simple way to understand the shape of your data. Unfortunately, this way may easily become a slippery slope. In the previous post, I have shown how density plots may deceive you when the bandwidth is poorly chosen. Today, we talk about histograms and why you can’t trust them in the general case.

Read more

The importance of kernel density estimation bandwidth

Below see two kernel density estimations. What could you say about them?

Most likely, you say that the first plot is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both density plots are based on the same sample:

21.370, 19.435, 20.363, 20.632, 20.404, 19.893, 21.511, 19.905, 22.018, 19.93,
31.304, 32.286, 28.611, 29.721, 29.866, 30.635, 29.715, 27.343, 27.559, 31.32,
39.693, 38.218, 39.828, 41.214, 41.895, 39.569, 39.742, 38.236, 40.460, 39.36,
50.455, 50.704, 51.035, 49.391, 50.504, 48.282, 49.215, 49.149, 47.585, 50.03

The only difference between plots is in bandwidth selection!

Bandwidth selection is crucial when you are trying to visualize your distributions. Unfortunately, most people just call a regular function to build a density plot and don’t think about how the bandwidth will be chosen. As a result, the plot may present data in the wrong way, which may lead to incorrect conclusions. Let’s discuss bandwidth selection in detail and figure out how to improve the correctness of your density plots. In this post, we will cover the following topics:

  • Kernel density estimation
  • How bandwidth selection affects plot smoothness
  • Which bandwidth selectors can we use
  • Which bandwidth selectors should we use
  • Insidious default bandwidth selectors in statistical packages

Read more

The median absolute deviation value of the Gumbel distribution

The Gumbel distribution is not only a useful model in the extreme value theory, but it’s also a nice example of a slightly right-skewed distribution (skewness \(\approx 1.14\)). Here is its density plot:

In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests. I also prefer the median absolute deviation (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions. Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution. I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself. In this post, you will find a short derivation and the result (spoiler: the exact value is 0.767049251325708 * β). The general approach of the MAD derivation is common for most distributions, so it can be easily reused.

Read more

Weighted quantile estimators

Update 2021-07-06: the approach was updated using the Kish’s effective sample size.

In this post, I will show how to calculate weighted quantile estimates and how to use them in practice.

Let’s start with a problem from real life. Imagine that you measure the total duration of a unit test executed daily on a CI server. Every day you get a single number that corresponds to the test duration from the latest revision for this day:

You collect a history of such measurements for 100 days. Now you want to describe the “actual” distribution of the performance measurements.

However, for the latest “actual” revision, you have only a single measurement, which is not enough to build a distribution. Also, you can’t build a distribution based on the last N measurements because they can contain change points that will spoil your results. So, what you really want to do is to use all the measurements, but older values should have a lower impact on the final distribution form.

Such a problem can be solved using the weighted quantiles! This powerful approach can be applied to any time series regardless of the domain area. In this post, we learn how to calculate and apply weighted quantiles.

Read more

Nonparametric Cohen's d-consistent effect size

Update: the second part of this post is available here.

The effect size is a common way to describe a difference between two distributions. When these distributions are normal, one of the most popular approaches to express the effect size is Cohen’s d. Unfortunately, it doesn’t work great for non-normal distributions.

In this post, I will show a robust Cohen’s d-consistent effect size formula for nonparametric distributions.

Read more

DoubleMAD outlier detector based on the Harrell-Davis quantile estimator

Outlier detection is an important step in data processing. Unfortunately, if the distribution is not normal (e.g., right-skewed and heavy-tailed), it’s hard to choose a robust outlier detection algorithm that will not be affected by tricky distribution properties. During the last several years, I tried many different approaches, but I was not satisfied with their results. Finally, I found an algorithm to which I have (almost) no complaints. It’s based on the double median absolute deviation and the Harrell-Davis quantile estimator. In this post, I will show how it works and why it’s better than some other approaches.

Read more

How ListSeparator Depends on Runtime and Operating System


This blog post was originally posted on JetBrains .NET blog.

In the two previous blog posts from this series, we discussed how socket errors and socket orders depend on the runtime and operating systems. For some, it may be obvious that some things are indeed specific to the operating system or the runtime, but often these issues come as a surprise and are only discovered when running our code on different systems. An interesting example that may bite us at runtime is using ListSeparator in our code. It should give us a common separator for list elements in a string. But is it really common? Let’s start our investigation by printing ListSeparator for the Russian language:

Console.WriteLine(new CultureInfo("ru-ru").TextInfo.ListSeparator);

On Windows, you will get the same result for .NET Framework, .NET Core, and Mono: the ListSeparator is ; (a semicolon). You will also get a semicolon on Mono+Unix. However, on .NET Core+Unix, you will get a non-breaking space.

Read more

How Sorting Order Depends on Runtime and Operating System


This blog post was originally posted on JetBrains .NET blog.

In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: jquery-1.4.1.js, jquery-1.4.1.min.js, jquery-1.4.1-vsdoc.js. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:


Read more

How Socket Error Codes Depend on Runtime and Operating System


This blog post was originally posted on JetBrains .NET blog.

Rider consists of several processes that send messages to each other via sockets. To ensure the reliability of the whole application, it’s important to properly handle all the socket errors. In our codebase, we had the following code which was adopted from Mono Debugger Libs and helps us communicate with debugger processes:

protected virtual bool ShouldRetryConnection (Exception ex, int attemptNumber)
    var sx = ex as SocketException;
    if (sx != null) {
        if (sx.ErrorCode == 10061) //connection refused
            return true;
    return false;

In the case of a failed connection because of a “ConnectionRefused” error, we are retrying the connection attempt. It works fine with .NET Framework and Mono. However, once we migrated to .NET Core, this method no longer correctly detects the “connection refused” situation on Linux and macOS. If we open the SocketException documentation, we will learn that this class has three different properties with error codes:

  • SocketError SocketErrorCode: Gets the error code that is associated with this exception.
  • int ErrorCode: Gets the error code that is associated with this exception.
  • int NativeErrorCode: Gets the Win32 error code associated with this exception.
What's the difference between these properties? Should we expect different values on different runtimes or different operating systems? Which one should we use in production? Why do we have problems with ShouldRetryConnection on .NET Core? Let's figure it all out!
Read more