When we compare two distributions, it's not always enough to detect a statistically significant difference between them. In many cases, we also want to evaluate the magnitude of this difference. Let's look at the following image:
On the left side, we can see a timeline plot with 2000 points (at the middle of this plot, the distribution was significantly changed). On the right side, you can see density plots for the left and the right side of the timeline plot (before and after the change). It's a pretty simple case, the difference between distributions be expressed via the difference between mean values.
Now let's look at a more tricky case:
Here we have a bimodal distribution; after the change, the left mode "moved right." Now it's much harder to evaluate the difference between distributions because the mean and the median values almost not changed: the right mode has the biggest impact on these metrics than the left more.
And here is a much more tricky case:
Here we also have a bimodal distribution; after the change, both modes moved: the left mode "moved right" and the right mode "moved left." How should we describe the difference between these distributions now?Read more
In many statistical papers, you can find the following phrase: "assuming that we have a normal distribution." Probably, you saw plots of the normal distribution density function in some statistics textbooks, it looks like this:
The normal distribution is a pretty user-friendly mental model when we are trying to interpret the statistical metrics like mean and standard deviation. However, it may also be an insidious and misleading model when your distribution is not normal. There is a great sentence in the "Testing for normality" paper by R.C. Geary, 1947 (the quote was found here):
Normality is a myth; there never was, and never will be, a normal distribution.
I 100% agree with this statement. At least, if you are working with performance distributions (that are based on the multiple iterations of your benchmarks that measure the performance metrics of your applications), you should forget about normality. That's how a typical performance distribution looks like (I built the below picture based on a real benchmark that measures the load time of assemblies when we open the Orchard solution in Rider on Linux):Read more
Changepoint detection is an important task that has a lot of applications. For example, I use it to detect changes in the Rider performance test suite. It's very important to detect not only performance degradations, but any kinds of performance changes (e.g., the variance may increase, or a unimodal distribution may be split to several modes). You can see examples of such changes on the following picture (we change the color when a changepoint is detected):
Unfortunately, it's pretty hard to write a reliable and fast algorithm for changepoint detection.
Recently, I found a cool paper (Haynes, K., Fearnhead, P. & Eckley, I.A. "A computationally efficient nonparametric approach for changepoint detection," Stat Comput (2017) 27: 1293) that describes the ED-PELT algorithm.
O(N*log(N)) complexity and pretty good detection accuracy.
The reference implementation can be used via the changepoint.np R package.
However, I can't use R on our build server, so I decided to write my own C# implementation.