Multimodality is an essential feature of a distribution, which may create many troubles during automatic analysis. One of the best ways to work with such distributions is to detect all the modes in advance based on the given samples. Unfortunately, this problem is much harder than it looks like.
I tried many different approaches for multimodality detection, but none of them was good enough. During the past several years, my approach of choice was the mvalue-based modal test by Brendan Gregg. It works nicely in simple cases, but I was constantly stumbling over noisy samples where this algorithm doesn’t produce reliable results. Also, it has some limitations that make it unapplicable to some corner cases.
So, I needed a better approach. Here are my main requirements:
- It should detect the exact mode locations and ranges
- It should provide reliable results even on noisy samples
- It should be able to detect multimodality even when some modes are extremely close to each other
- It should work out of the box without tricky parameter tuning for each specific distribution
I failed to find such an algorithm anywhere, so I came up with my own! The current working title is “the lowland multimodality detector.” It takes an estimation of the probability density function (PDF) and tries to find “lowlands” (areas that are much lower than neighboring peaks). Next, it splits the plot by these lowlands and detects modes between them. For the PDF estimation, it uses the quantile-respectful density estimation based on the Harrell-Davis quantile estimator (QRDE-HD). Let me explain how it works in detail.
Here we can see a density plot based on a sample with highlighted decile locations that split the plot into 10 equal parts. Before the conference, I have been reviewed by @VladimirSitnikv. He raised a reasonable concern: it doesn’t look like all the density plot segments are equal and contain exactly 10% of the whole plot. And he was right!
However, I didn’t make any miscalculations. I generated a real sample with 61 elements. Next, I build a density plot with the kernel density estimation (KDE) using the Sheather & Jones method and the normal kernel. Next, I calculated decile values using the Harrell-Davis quantile estimator. Although both the density plot and the decile values are calculated correctly and consistent with the sample, they are not consistent with each other! Indeed, such a density plot is just an estimation of the underlying distribution. It has its own decile values, which are not equal to the sample decile values regardless of the used quantile estimator. This problem is common for different kinds of visualization that presents density and quantiles at the same time (e.g., violin plots)
It leads us to a question: how should we present the shape of our data together with quantile values without confusing inconsistency in the final image? Today I will present a good solution: we should use the quantile-respectful density estimation based on the Harrell-Davis quantile estimator! I know the title is a bit long, but it’s not so complicated as it sounds. In this post, I will show how to build such plots. Also I will compare them to the classic histograms and kernel density estimations. As a bonus, I will demonstrate how awesome these plots are for multimodality detection.
Below you see two histograms. What could you say about them?
Most likely, you say that the first histogram is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both histograms are based on the same sample:
20.13, 19.94, 20.03, 20.06, 20.04, 19.98, 20.15, 19.99, 20.20, 19.99, 20.13, 20.22, 19.86, 19.97, 19.98, 20.06, 29.97, 29.73, 29.75, 30.13, 29.96, 29.82, 29.98, 30.12, 30.18, 29.95, 29.97, 29.82, 30.04, 29.93, 30.04, 30.07, 40.10, 39.93, 40.05, 39.82, 39.92, 39.91, 39.75, 40.00, 40.02, 39.96, 40.07, 39.92, 39.86, 40.04, 39.91, 40.14, 49.95, 50.06, 50.03, 49.92, 50.15, 50.06, 50.00, 50.02, 50.06, 50.00, 49.70, 50.02, 49.96, 50.01, 50.05, 50.13
Thus, the only difference between histograms is the offset!
Visualization is a simple way to understand the shape of your data. Unfortunately, this way may easily become a slippery slope. In the previous post, I have shown how density plots may deceive you when the bandwidth is poorly chosen. Today, we talk about histograms and why you can’t trust them in the general case.
Below see two kernel density estimations. What could you say about them?
Most likely, you say that the first plot is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both density plots are based on the same sample:
21.370, 19.435, 20.363, 20.632, 20.404, 19.893, 21.511, 19.905, 22.018, 19.93, 31.304, 32.286, 28.611, 29.721, 29.866, 30.635, 29.715, 27.343, 27.559, 31.32, 39.693, 38.218, 39.828, 41.214, 41.895, 39.569, 39.742, 38.236, 40.460, 39.36, 50.455, 50.704, 51.035, 49.391, 50.504, 48.282, 49.215, 49.149, 47.585, 50.03
The only difference between plots is in bandwidth selection!
Bandwidth selection is crucial when you are trying to visualize your distributions. Unfortunately, most people just call a regular function to build a density plot and don’t think about how the bandwidth will be chosen. As a result, the plot may present data in the wrong way, which may lead to incorrect conclusions. Let’s discuss bandwidth selection in detail and figure out how to improve the correctness of your density plots. In this post, we will cover the following topics:
- Kernel density estimation
- How bandwidth selection affects plot smoothness
- Which bandwidth selectors can we use
- Which bandwidth selectors should we use
- Insidious default bandwidth selectors in statistical packages
The Gumbel distribution is not only a useful model in the extreme value theory, but it’s also a nice example of a slightly right-skewed distribution (skewness \(\approx 1.14\)). Here is its density plot:
In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests.
I also prefer the median absolute deviation (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions.
Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution.
I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself.
In this post, you will find a short derivation and the result (spoiler: the exact value is
0.767049251325708 * β).
The general approach of the MAD derivation is common for most distributions, so it can be easily reused.
In this post, I will show how to calculate weighted quantile estimates and how to use them in practice.
Let’s start with a problem from real life. Imagine that you measure the total duration of a unit test executed daily on a CI server. Every day you get a single number that corresponds to the test duration from the latest revision for this day:
You collect a history of such measurements for 100 days. Now you want to describe the “actual” distribution of the performance measurements.
However, for the latest “actual” revision, you have only a single measurement, which is not enough to build a distribution. Also, you can’t build a distribution based on the last N measurements because they can contain change points that will spoil your results. So, what you really want to do is to use all the measurements, but older values should have a lower impact on the final distribution form.
Such a problem can be solved using the weighted quantiles! This powerful approach can be applied to any time series regardless of the domain area. In this post, we learn how to calculate and apply weighted quantiles.
The effect size is a common way to describe a difference between two distributions. When these distributions are normal, one of the most popular approaches to express the effect size is Cohen’s d. Unfortunately, it doesn’t work great for non-normal distributions.
In this post, I will show a robust Cohen’s d-consistent effect size formula for nonparametric distributions.
Outlier detection is an important step in data processing. Unfortunately, if the distribution is not normal (e.g., right-skewed and heavy-tailed), it’s hard to choose a robust outlier detection algorithm that will not be affected by tricky distribution properties. During the last several years, I tried many different approaches, but I was not satisfied with their results. Finally, I found an algorithm to which I have (almost) no complaints. It’s based on the double median absolute deviation and the Harrell-Davis quantile estimator. In this post, I will show how it works and why it’s better than some other approaches.
In the two previous blog posts from this series, we discussed how socket errors and socket orders depend on the runtime and operating systems. For some, it may be obvious that some things are indeed specific to the operating system or the runtime, but often these issues come as a surprise and are only discovered when running our code on different systems.
An interesting example that may bite us at runtime is using
ListSeparator in our code. It should give us a common separator for list elements in a string. But is it really common?
Let’s start our investigation by printing
ListSeparator for the Russian language:
On Windows, you will get the same result for .NET Framework, .NET Core, and Mono: the
; (a semicolon). You will also get a semicolon on Mono+Unix. However, on .NET Core+Unix, you will get a non-breaking space.
In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files:
jquery-1.4.1-vsdoc.js. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:
jquery-1.4.1.js jquery-1.4.1.min.js jquery-1.4.1-vsdoc.js