Misleading histograms



Below you see two histograms. What could you say about them?


Most likely, you say that the first histogram is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both histograms are based on the same sample:

20.13, 19.94, 20.03, 20.06, 20.04, 19.98, 20.15, 19.99, 20.20, 19.99, 20.13, 20.22, 19.86, 19.97, 19.98, 20.06,
29.97, 29.73, 29.75, 30.13, 29.96, 29.82, 29.98, 30.12, 30.18, 29.95, 29.97, 29.82, 30.04, 29.93, 30.04, 30.07,
40.10, 39.93, 40.05, 39.82, 39.92, 39.91, 39.75, 40.00, 40.02, 39.96, 40.07, 39.92, 39.86, 40.04, 39.91, 40.14,
49.95, 50.06, 50.03, 49.92, 50.15, 50.06, 50.00, 50.02, 50.06, 50.00, 49.70, 50.02, 49.96, 50.01, 50.05, 50.13

Thus, the only difference between histograms is the offset!

Visualization is a simple way to understand the shape of your data. Unfortunately, this way may easily become a slippery slope. In the previous post, I have shown how density plots may deceive you when the bandwidth is poorly chosen. Today, we talk about histograms and why you can’t trust them in the general case.


Read more


The importance of kernel density estimation bandwidth



Below see two kernel density estimations. What could you say about them?


Most likely, you say that the first plot is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both density plots are based on the same sample:

21.370, 19.435, 20.363, 20.632, 20.404, 19.893, 21.511, 19.905, 22.018, 19.93,
31.304, 32.286, 28.611, 29.721, 29.866, 30.635, 29.715, 27.343, 27.559, 31.32,
39.693, 38.218, 39.828, 41.214, 41.895, 39.569, 39.742, 38.236, 40.460, 39.36,
50.455, 50.704, 51.035, 49.391, 50.504, 48.282, 49.215, 49.149, 47.585, 50.03

The only difference between plots is in bandwidth selection!

Bandwidth selection is crucial when you are trying to visualize your distributions. Unfortunately, most people just call a regular function to build a density plot and don’t think about how the bandwidth will be chosen. As a result, the plot may present data in the wrong way, which may lead to incorrect conclusions. Let’s discuss bandwidth selection in detail and figure out how to improve the correctness of your density plots. In this post, we will cover the following topics:

  • Kernel density estimation
  • How bandwidth selection affects plot smoothness
  • Which bandwidth selectors can we use
  • Which bandwidth selectors should we use
  • Insidious default bandwidth selectors in statistical packages

Read more


The median absolute deviation value of the Gumbel distribution



The Gumbel distribution is not only a useful model in the extreme value theory, but it’s also a nice example of a slightly right-skewed distribution (skewness \(\approx 1.14\)). Here is its density plot:


In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests. I also prefer the median absolute deviation (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions. Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution. I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself. In this post, you will find a short derivation and the result (spoiler: the exact value is 0.767049251325708 * β). The general approach of the MAD derivation is common for most distributions, so it can be easily reused.


Read more


Weighted quantile estimators



In this post, I will show how to calculate weighted quantile estimates and how to use them in practice.

Let’s start with a problem from real life. Imagine that you measure the total duration of a unit test executed daily on a CI server. Every day you get a single number that corresponds to the test duration from the latest revision for this day:


You collect a history of such measurements for 100 days. Now you want to describe the “actual” distribution of the performance measurements.

However, for the latest “actual” revision, you have only a single measurement, which is not enough to build a distribution. Also, you can’t build a distribution based on the last N measurements because they can contain change points that will spoil your results. So, what you really want to do is to use all the measurements, but older values should have a lower impact on the final distribution form.

Such a problem can be solved using the weighted quantiles! This powerful approach can be applied to any time series regardless of the domain area. In this post, we learn how to calculate and apply weighted quantiles.


Read more


Nonparametric Cohen's d-consistent effect size



The effect size is a common way to describe a difference between two distributions. When these distributions are normal, one of the most popular approaches to express the effect size is Cohen’s d. Unfortunately, it doesn’t work great for non-normal distributions.

In this post, I will show a robust Cohen’s d-consistent effect size formula for nonparametric distributions.



Read more


Yet another robust outlier detector



Outlier detection is an important step in data processing. Unfortunately, if the distribution is not normal (e.g., right-skewed and heavy-tailed), it’s hard to choose a robust outlier detection algorithm that will not be affected by tricky distribution properties. During the last several years, I tried many different approaches, but I was not satisfied with their results. Finally, I found an algorithm to which I have (almost) no complaints. It’s based on the double median absolute deviation and the Harrell-Davis quantile estimator. In this post, I will show how it works and why it’s better than some other approaches.


Read more


How ListSeparator Depends on Runtime and Operating System



This blog post was originally posted on JetBrains .NET blog.

In the two previous blog posts from this series, we discussed how socket errors and socket orders depend on the runtime and operating systems. For some, it may be obvious that some things are indeed specific to the operating system or the runtime, but often these issues come as a surprise and are only discovered when running our code on different systems. An interesting example that may bite us at runtime is using ListSeparator in our code. It should give us a common separator for list elements in a string. But is it really common? Let’s start our investigation by printing ListSeparator for the Russian language:

Console.WriteLine(new CultureInfo("ru-ru").TextInfo.ListSeparator);

On Windows, you will get the same result for .NET Framework, .NET Core, and Mono: the ListSeparator is ; (a semicolon). You will also get a semicolon on Mono+Unix. However, on .NET Core+Unix, you will get a non-breaking space.


Read more


How Sorting Order Depends on Runtime and Operating System



This blog post was originally posted on JetBrains .NET blog.

In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: jquery-1.4.1.js, jquery-1.4.1.min.js, jquery-1.4.1-vsdoc.js. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:

jquery-1.4.1.js
jquery-1.4.1.min.js
jquery-1.4.1-vsdoc.js

Read more


How Socket Error Codes Depend on Runtime and Operating System



This blog post was originally posted on JetBrains .NET blog.

Rider consists of several processes that send messages to each other via sockets. To ensure the reliability of the whole application, it’s important to properly handle all the socket errors. In our codebase, we had the following code which was adopted from Mono Debugger Libs and helps us communicate with debugger processes:

protected virtual bool ShouldRetryConnection (Exception ex, int attemptNumber)
{
    var sx = ex as SocketException;
    if (sx != null) {
        if (sx.ErrorCode == 10061) //connection refused
            return true;
    }
    return false;
}

In the case of a failed connection because of a “ConnectionRefused” error, we are retrying the connection attempt. It works fine with .NET Framework and Mono. However, once we migrated to .NET Core, this method no longer correctly detects the “connection refused” situation on Linux and macOS. If we open the SocketException documentation, we will learn that this class has three different properties with error codes:

  • SocketError SocketErrorCode: Gets the error code that is associated with this exception.
  • int ErrorCode: Gets the error code that is associated with this exception.
  • int NativeErrorCode: Gets the Win32 error code associated with this exception.
What's the difference between these properties? Should we expect different values on different runtimes or different operating systems? Which one should we use in production? Why do we have problems with ShouldRetryConnection on .NET Core? Let's figure it all out!
Read more


.NET Core performance revolution in Rider 2020.1



This blog post was originally posted on JetBrains .NET blog.

Many Rider users may know that the IDE has two main processes: frontend (Java-application based on the IntelliJ platform) and backend (.NET-application based on ReSharper). Since the first release of Rider, we’ve used Mono as the backend runtime on Linux and macOS. A few years ago, we decided to migrate to .NET Core. After resolving hundreds of technical challenges, we are finally ready to present the .NET Core edition of Rider!

In this blog post, we want to share the results of some benchmarks that compare the Mono-powered and the .NET Core-powered editions of Rider. You may find this interesting if you are also thinking about migrating to .NET Core, or if you just want a high-level overview of the improvements to Rider in terms of performance and footprint, following the migration. (Spoiler: they’re huge!)


Read more