Posts on Andrey Akinshin

Lowland multimodality detection and robustness

Tue, 23 Apr 2024 00:00:00 +0000

We continue exploring various corner cases for the Lowland multimodality detection. In this post, we consider an example that illustrates the usefulness of THDQE.

Embracing model misspecification

Tue, 16 Apr 2024 00:00:00 +0000

When researchers focus on model design, they often worry whether the model is correct or not. I believe that we should accept the fact that all the models are wrong. The world is too complex to be captured by a single model: we are never able to acknowledge all the variables. Therefore, the answer to the question “Is the model correct?” is always “No”. It should not bother us: from the pragmatic perspective, it is irrelevant whether the model is correct or not. If we embrace the model misspecification, we can switch our attention to the question “What is the impact of deviations from the model on the decision-making?”

Recently, I was reading cerreia2020. I am still in the process of understanding the technical part, but I was charmed by the Introduction, so I want to share quotes I liked from this paper and referenced box1976 and chatfield1995.

Preprint announcement: 'Quantile-Respectful Density Estimation Based on the Harrell-Davis Quantile Estimator'

Tue, 09 Apr 2024 00:00:00 +0000

I have just published a preprint of a paper ‘Quantile-Respectful Density Estimation Based on the Harrell-Davis Quantile Estimator’. It is based on a series of my research notes.

The paper preprint is available on arXiv: arXiv:2404.03835 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-qrdehd. You can cite it as follows:

Andrey Akinshin (2024) “Quantile-Respectful Density Estimation Based on the Harrell-Davis Quantile Estimator” arXiv:2404.03835

Abstract:

Traditional density and quantile estimators are often inconsistent with each other. Their simultaneous usage may lead to inconsistent results. To address this issue, we propose a novel smooth density estimator that is naturally consistent with the Harrell-Davis quantile estimator. We also provide a jittering implementation to support discrete-continuous mixture distributions.

Lowland multimodality detection and jittering

Tue, 02 Apr 2024 00:00:00 +0000

In A better jittering approach for discretization acknowledgment in density estimation, I discussed the jittering approach that improves Quantile-Respectful Density Estimation for discrete distributions and continuous-discrete mixtures. In this post, I will show a brief example of how such an approach improves the accuracy of the Lowland multimodality detection.

Quantile-Respectful Density Estimation and Trimming

Tue, 26 Mar 2024 00:00:00 +0000

I continue the topic of Quantile-Respectful Density Estimation in the context of Multimodality Detection. In this post, we briefly discuss the handling of the QRDE boundary spikes in order to correctly detect the near-border modes.

A better jittering approach for discretization acknowledgment in density estimation

Tue, 19 Mar 2024 00:00:00 +0000

In How to build a smooth density estimation for a discrete sample using jittering, I proposed a jittering approach. It turned out that it does not always work well. It is not always capable of preserving the original distribution shape and avoiding gaps. In this post, I would like to propose a better strategy.

Effect Sizes and Asymmetry

Tue, 12 Mar 2024 00:00:00 +0000

Cohen’s d is one of the most popular measures of the effect size. Unfortunately, it was designed for the normal distribution, which may make it a misleading measure in the non-normal case. And the real distributions are never normal. When we discuss deviations from normality, we should treat the illusion of normality not as an atomic mental construction, but rather as a set of independent assumptions, each of which may be violated independently. In this post, I take a look at what kind of issues we may have when the symmetry assumption is heavily violated.

Pragmatic Statistics Manifesto

Tue, 05 Mar 2024 00:00:00 +0000

Statistics is one of the most confusing, controversial, and depressing disciplines I know. So many different approaches, so many different opinions, so many arguments, so many person-years of wasted time, and so many flawed peer-reviewed papers.

What we want from statistics is an easy-to-use tool that would nudge us toward asking the right questions and then straightforwardly guide us on how to design proper and relevant statistical procedures. What we have is a bunch of vaguely described sets of strange equations, a few arbitrarily chosen magical numbers as thresholds, and no clear understanding of what to do.

In the scientific community, there are a lot of adherents of Frequentist statistics (both Neyman-Pearson and Fisherian), Bayesian statistics, Likelihood statistics, Nonparametric statistics, Robust statistics, and many other statistics. And almost no one discusses Pragmatic statistics. I feel like we really need something which is called Pragmatic statistics. However, it should not be just a set of “blessed” approaches but rather a mindset.

Let me make an attempt to speculate on the principles that should form the foundation of the Pragmatic statistics approach. In future posts, I will show how to apply these principles to solve real-world problems.

The Effect Existence, Its Magnitude, and the Goals

Tue, 27 Feb 2024 00:00:00 +0000

If you are curious if something impacts something else, the answer is probably “yes.” Does that indicator depend on those factors? Yes, it does. If we change this thing, would it affect …? Yes, it would. If a person takes this pill, could it cause a non-exactly-zero change in the body? Yes, the presence of the pill is already a change that can always be detected with the right amount of effort.

One may argue that in some cases (assuming the list of specific cases is presented), zero effect does exist. For a moment, let us pretend that it is true. Now, let us imagine a parallel universe, which is the same as ours but with the presence of the effect. Unfortunately, the effect is so small that our tools are not sophisticated enough to detect it. Imagine being put into one of these worlds, but you don’t know which one. How do you determine the existence of the effect? Of course, you can improve the resolution of the measurement tools via new scientific discoveries, but with the current state of technology, the absence of the effect cannot be checked. Therefore, it is always safer to assume that the effect exists, keeping in mind that it can be negligible. Let us accept this assumption and continue if it is absolute truth.

Case Study: A City Social Survey

Tue, 20 Feb 2024 00:00:00 +0000

Imagine a city mayor considering a project offering to build parks in several neighborhoods. It can be a good budget investment since it can potentially increase the happiness level of the citizens. However, it is just a hypothesis: if parks do not impact happiness, it is worth considering other city renovation projects. It makes sense to perform a pilot experiment before spending the budget on all the parks. The mayor is thinking about the following plan: pick a random neighborhood, survey the citizens to measure their happiness, build a park, survey the citizens again, compare the survey results, make a decision about the further parks in other neighborhoods. Someone is needed to design the survey and draw the conclusion.

Let us explore possible approaches to perform such a study. These artificial examples are not guidelines but rather simplified illustrations of possible mindsets presented as lists of thoughts. In this demonstration, we mainly focus on the attitude to the research process rather than on the technical details. All the examples are based on real stories.

Simplifying adjustments of confidence levels and practical significance thresholds

Tue, 13 Feb 2024 00:00:00 +0000

Translation of the buisness goals to the actual parameters of the statistical procedure is a non-trivial task. The degree of non-triviality increases if we should adjust several parameters at the same time. In this post, we consider a problem of simultaneous choice of the confidence level and the practical significance threshold. We discuss possible pitfalls and how to simplify the adjusting procedure to avoid them.

Degrees of practical significance

Tue, 06 Feb 2024 00:00:00 +0000

Let’s say we have two data samples, and we want to check if there is a difference between them. If we are talking about any kind of difference, the answer is most probably yes. It’s highly unlikely that two random samples are identical. Even if they are, there are still chances that we observe such a situation by accident, and there is a difference in the underlying distributions. Therefore, the discussion about the existence of any kind of difference is not meaningful.

To make more meaningful insights, researchers often talk about statistical significance. The approach can also be misleading. If the sample size is large enough, we are almost always able to detect even a neglectable difference and obtain a statistically significant result for any pair of distributions. On the other hand, a huge difference can be declared insignificant if the sample size is small. While the concept is interesting and well-researched, it rarely matches the actual research goal. I strongly believe that we should not test for the nil hypothesis (checking if the true difference is exactly zero).

Here, we can switch from statistical significance to practical significance. We are supposed to define a threshold (e.g., in terms of minimum effect size) for the difference that is meaningful for the research. This approach has more chances to be aligned with the research goals. However, it is also not always satisfying enough. We should keep in mind that hypothesis testing often arises in the context of decision-making problems. In some cases, we can do exploration research in which we just want to have a better understanding of the world. However, in most cases, we do not perform calculations just because we are curious; we often want to make a decision based on the results. And this is the most crucial moment. It should always be the starting point in any research project. First of all, we should clearly describe the possible decisions and their preconditions. When we start doing that, we can discover that not all the practically significant outcomes are equally significant. If different practically significant results may lead to different decisions, we should define the proper classification in advance during the research design stage. The dichotomy of “practically significant” vs. “not practically significant” may conceal important problem aspects and lead to a wrong decision.

In this post, I would like to discuss the degrees of practical significance and show an example of how important it is for some problems.

Weighted Mann-Whitney U test, Part 3

Tue, 30 Jan 2024 00:00:00 +0000

I continue building a weighted version of the Mann–Whitney $U$ test. While previously suggested approach feel promising, I don’t like the usage of Bootstrap to obtain the $p$-value. It is always better to have a deterministic and exact approach where it’s possible. I still don’t know how to solve it in general case, but it seems that I’ve obtained a reasonable solution for some specific cases. The current version of the approach still has issues and requires additional correction factors in some cases and additional improvements. However, it passes my minimal requirements, so it is worth trying to continue developing this idea. In this post, I share the description of the weighted approach and provide numerical examples.

Andreas Löffler's implementation of the exact p-values calculations for the Mann-Whitney U test

Tue, 23 Jan 2024 00:00:00 +0000

Mann-Whitney is one of the most popular non-parametric statistical tests. Unfortunately, most test implementations in statistical packages are far from perfect. The exact p-value calculation is time-consuming and can be impractical for large samples. Therefore, most implementations automatically switch to the asymptotic approximation, which can be quite inaccurate. Indeed, the classic normal approximation could produce enormous errors. Thanks to the Edgeworth expansion, the accuracy can be improved, but it is still not always satisfactory enough. I prefer using the exact p-value calculation whenever possible.

The computational complexity of the exact p-value calculation using the classic recurrent equation suggested by Mann and Whitney is $\mathcal{O}(n^2 m^2)$ in terms of time and memory. It’s not a problem for small samples, but for medium-size samples, it is slow, and it has an extremely huge memory footprint. This gives us an unpleasant dilemma: either we use the exact p-value calculation (which is extremely time and memory-consuming), or we use the asymptotic approximation (which gives poor accuracy).

Last week, I got acquainted with a brilliant algorithm for the exact p-value calculation suggested by Andreas Löffler in 1982. It’s much faster than the classic approach, and it requires only $\mathcal{O}(n+m)$ memory.

Eclectic statistics

Tue, 16 Jan 2024 00:00:00 +0000

In the world of mathematical statistics, there is a constant confrontation between adepts of different paradigms. This is a constant source of confusion for many researchers who struggle to pick out the proper approach to follow. For example, how to choose between the frequentist and Bayesian approaches? Since these paradigms may produce inconsistent results (e.g., see Lindley’s paradox), some choice has to be made. The easiest way to conduct research is to pick a single paradigm and stick to it. The right way to conduct research is to carefully think.

Change Point Detection and Recent Changes

Tue, 09 Jan 2024 00:00:00 +0000

Change point detection (CPD) in time series analysis is an essential tool for identifying significant shifts in data patterns. These shifts, or “change points,” can signal critical transitions in various contexts. While most CPD algorithms are adept at discovering historical change points, their sensitivity in detecting recent changes can be limited, often due to a key parameter: the minimum distance between sequential change points. In this post, I share some speculations on how we can improve cpd analysis by combining two change point detectors.

Merging extended P² quantile estimators, Part 1

Tue, 02 Jan 2024 00:00:00 +0000

P² quantile estimator is a streaming quantile estimator with $\mathcal{O}(1)$ memory footprint and an extremely fast update procedure. Several days ago, I learned that it was adopted for the new Paint.NET GPU-based Median Sketch effect (the description is here). While P² meets the basic problem requirement (streaming median approximation without storing all the values), the algorithm performance is still not acceptable without additional adjustments. A significant performance improvement can be obtained if we split the input stream, process each part separately with a separate P², and merge the results. Unfortunately, the merging procedure is a tricky thing to implement. I enjoy such challenges, so I decided to attempt to build such a merging approach. In this post, I describe my first attempt.

Hodges-Lehmann ratio estimator vs. Bhattacharyya's scale ratio estimator

Tue, 26 Dec 2023 00:00:00 +0000

Previously, I discussed an idea of a ratio estimator based on the Hodges-Lehmann estimator. This idea looks so simple and natural that I was sure that it must have already been proposed and studied. However, when I started to search for it, it turned out that it was not as easy as I expected. Moreover, some papers attribute this idea to Bhattacharyya, which is not accurate. In this post, we discuss the difference between these two approaches.

Finite-sample Gaussian efficiency: Shamos vs. Rousseeuw-Croux Qn scale estimators

Tue, 19 Dec 2023 00:00:00 +0000

Previously, we compared the finite-sample Gaussian efficiency of the Rousseeuw-Croux scale estimators and the QAD estimator. In this post, we compare the finite-sample Gaussian efficiency of the Shamos scale estimator and the Rousseeuw-Croux $Q_n$ scale estimator. This is a particularly interesting comparison. In the famous “Alternatives to the Median Absolute Deviation” (1993) paper by Peter J. Rousseeuw and Christophe Croux, the authors presented $Q_n$ as an improved version of the Shamos estimator. Both estimators are based on the set of pairwise absolute differences between the elements of the sample. The Shamos estimator takes the median of this set and, therefore, has the asymptotic breakdown point of $\approx 29\%$ and the asymptotic Gaussian efficiency of $\approx 86\%$. $Q_n$ takes the first quartile of this set and, therefore, has the asymptotic breakdown point of $\approx 50\%$ (like the median) and the asymptotic Gaussian efficiency of $\approx 82\%$. It sounds like a good deal: we trade $4\%$ of the asymptotic Gaussian efficiency for $21\%$ of the asymptotic breakdown point. What could possibly stop us from using $Q_n$ everywhere instead of the Shamos estimator?

Well, here is a trick. The breakdown point of $29\%$ is actually a practically reasonable value. If more than $29\%$ of the sample are outliers, we should probably consider them not as outliers but as a separate mode. Such a situation should be handled by a multimodality detector and lead us to a different approach. The usage of dispersion estimators in the case of multimodal distributions is potentially misleading. When such a multimodality diagnostic scheme is used, there is no practical need for a higher breakdown point.

Thus, the breakdown point of $50\%$ is not so impressive property of $Q_n$. Meanwhile, the drop in Gaussian efficiency is not so enjoyable. $4\%$ may sound like a negligible difference, but it is only the asymptotic value. In real life, we typically tend to work with finite samples. Let us explore the actual finite-sample Gaussian efficiency values of these estimators.

Two-pass change point detection for temporary interval condensation

Tue, 12 Dec 2023 00:00:00 +0000

When we choose a change point detection algorithm, the most important thing is to clearly understand why we want to detect the change points. The knowledge of the final business goals is essential. In this post, I show a simple example of how a business requirement can be translated into algorithm adjustments.

Inconsistent violin plots

Tue, 05 Dec 2023 00:00:00 +0000

The usefulness and meaningfulness of the violin plots are dubious (e.g., see this video and the corresponding discussion). While this type of plot inherits issues of density plots (e.g., the bandwidth selection problem) and box plots, it also introduces new problems. One such problem is data inconsistency: default density plots and box plots are often incompatible with each other. In this post, I show an example of this inconsistency.

Sporadic noise problem in change point detection

Tue, 28 Nov 2023 00:00:00 +0000

We consider a problem of change point detection at the end of a time series. Let us say that we systematically monitor readings of an indicator, and we want to react to noticeable changes in the measured values as fast as possible. When there are no changes in the underlying distribution, any alerts about detected change points should be considered false positives. Typically, in such problems, we consider the i.i.d. assumption that claims that in the absence of change points, all the measurements are independent and identically distributed. Such an assumption significantly simplifies the mathematical model, but unfortunately, it is rarely fully satisfied in real life. If we want to build a reliable change point detection system, it is important to be aware of possible real-life artifacts that introduce deviations from the declared model. In this problem, I discuss the problem of the sporadic noise.

Resistance to the low-density regions: the Hodges-Lehmann location estimator based on the Harrell-Davis quantile estimator

Tue, 21 Nov 2023 00:00:00 +0000

Previously, I have discussed the topic of the resistance to the low-density regions of various estimators including the Hodges-Lehmann location estimator ($\operatorname{HL}$). In general, $\operatorname{HL}$ is a great estimator with great statistical efficiency and a decent breakdown point. Unfortunately, it has low resistance to the low-density regions around $29^\textrm{th}$ and $71^\textrm{th}$ percentiles, which may cause troubles in the case of multimodal distributions. I am trying to find a modification of $\operatorname{HL}$ that performs almost the same as the original $\operatorname{HL}$, but has increased resistance. One of the ideas I had was using the Harrell-Davis quantile estimator instead of the sample median to evaluate $\operatorname{HL}$. Regrettably, this idea did not turn out to be successful: such an estimator has a resistance function similar to the original $\operatorname{HL}$. I believe that it is important to share negative results, and therefore this post contains a bunch of plots, which illustrate results of relevant numerical simulations.

Median vs. Hodges-Lehmann: compare efficiency under heavy-tailedness

Tue, 14 Nov 2023 00:00:00 +0000

In the previous post, I shared some thoughts on how to evaluate the statistical efficiency of estimators under heavy-tailed distributions. In this post, I apply the described ideas to actually compare efficiency values of the Mean, the Sample Median, and the Hodges-Lehmann location estimator under various distributions.

Thoughts about robustness and efficiency

Tue, 07 Nov 2023 00:00:00 +0000

Statistical efficiency is an essential characteristic, which has to be taken into account when we choose between different estimators. When the underlying distribution is a normal one or at least light-tailed, evaluation of the statistical efficiency typically is not so hard. However, when the underlying distribution is a heavy-tailed one, problems appear. The statistical efficiency is usually expressed via the mean squared error or via variance, which are not robust. Therefore, heavy-tailedness may lead to distorted or even infinite efficiency, which is quite impractical. So, how do we compare the efficiency of estimators under a heavy-tailed distribution? Let’s say we want to compare the efficiency of the mean and the median distribution. Under the normal distribution (so-called Gaussian efficiency), this task is trivial: we build the sampling mean distribution and the sampling median distribution, estimate the variance for each of them, and then get the ratio of these variances. However, if we are interested in the median, we are probably expecting some outliers. Most of the significant real-life outliers come from the heavy-tailed distributions. Therefore, Gaussian efficiency is not the most interesting metric. It makes sense to evaluate the efficiency of the considered estimators under various heavy-tailed distributions. Unfortunately, the variance is not a robust measure and is too sensitive to tails: if the sampling distribution is also not normal or even heavy-tailed, the meaningfulness of the true variance value decreases. It seems reasonable to consider alternative robust measures of dispersion. Which one should we choose? Maybe Median Absolute Deviation (MAD)? Well, the asymptotic Gaussian efficiency of MAD is only ~37%. And here we have the same problem: should we trust the Gaussian efficiency under heavy-tailedness? Therefore, we should first evaluate the efficiency of dispersion estimators. But we can’t do it without a previously chosen dispersion estimator! And could we truly express the actual relative efficiency between two estimators under tricky asymmetric multimodal heavy-tailed distributions using a single number?

Finite-sample Gaussian efficiency: Quantile absolute deviation vs. Rousseeuw-Croux scale estimators

Tue, 31 Oct 2023 00:00:00 +0000

In this post, we discuss the finite-sample Gaussian efficiency of various robust dispersion estimators. The classic standard deviation has the highest possible Gaussian efficiency of $100\%$, but it is not robust: a single outlier can completely destroy the estimation. A typical robust alternative to the standard deviation is the Median Absolute Deviation ($\operatorname{MAD}$). While the $\operatorname{MAD}$ is highly robust (the breakdown point is $50\%$), it is not efficient: its asymptotic Gaussian efficiency is only $37\%$. Common alternative to the $\operatorname{MAD}$ is the Rousseeuw-Croux $S_n$ and $Q_n$ scale estimators that provide higher efficiency, keeping the breakdown point of $50\%$. In one of my recent preprints, I introduced the concept of the Quantile Absolute Deviation ($\operatorname{QAD}$) and its specific cases: the Standard Quantile Absolute Deviation ($\operatorname{SQAD}$) and the Optimal Quantile Absolute Deviation ($\operatorname{OQAD}$). Let us review the finite-sample and asymptotic values of the Gaussian efficiency for these estimators.

Mann-Whitney U test and heteroscedasticity

Tue, 24 Oct 2023 00:00:00 +0000

Mann-Whitney U test is a good nonparametric test, which mostly targets changes in locations. However, it doesn’t properly support all types of differences between the two distributions. Specifically, it poorly handles changes in variance. In this post, I briefly discuss its behavior in reaction to scaling a distribution without introducing location changes.

Exploring the power curve of the Ansari-Bradley test

Tue, 17 Oct 2023 00:00:00 +0000

The Ansari-Bradley test is a popular rank-based nonparametric test for a difference in scale/dispersion parameters. In this post, we explore its power curve in a numerical simulation.

Exploring the power curve of the Lepage test

Tue, 10 Oct 2023 00:00:00 +0000

Previously, I already discussed the Cucconi test. In this post, I continue the topic of nonparametric tests and check out the Lepage test.

Weighted Hodges-Lehmann location estimator and mixture distributions

Tue, 03 Oct 2023 00:00:00 +0000

The classic non-weighted Hodges-Lehmann location estimator of a sample $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right), $$

where $\operatorname{median}$ is the sample median. Previously, we have defined a weighted version of the Hodges-Lehmann location estimator as follows:

$$ \operatorname{WHL}(\mathbf{x}, \mathbf{w}) = \underset{1 \leq i \leq j \leq n}{\operatorname{wmedian}} \left(\frac{x_i + x_j}{2},\; w_i \cdot w_j \right), $$

where $\mathbf{w} = (w_1, w_2, \ldots, w_n)$ is the vector of weights, $\operatorname{wmedian}$ is the weighted median. For simplicity, in the scope of the current post, Hyndman-Fan Type 7 quantile estimator is used as the base for the weighted median.

In this post, we consider a numerical simulation in which we compare sampling distribution of $\operatorname{HL}$ and $\operatorname{WHL}$ in a case of mixture distribution.

Carling’s Modification of the Tukey's fences

Tue, 26 Sep 2023 00:00:00 +0000

Let us consider the classic problem of outlier detection in one-dimensional sample. One of the most popular approaches is Tukey’s fences, that defines the following range:

$$ [Q_1 - k(Q_3 - Q_1);\; Q_3 + k(Q_3 - Q_1)], $$

where $Q_1$ and $Q_3$ are the first and the third quartiles of the given sample.

All the values outside the given range are classified as outliers. The typical values of $k$ are $1.5$ for “usual outliers” and $3.0$ for “far out” outliers. In the classic Tukey’s fences approach, $k$ is often a predefined constant. However, there are alternative approaches that define $k$ dynamically based on the given sample. One of the possible variations of Tukey’s fences is Carling’s modification that defines $k$ as follows:

$$ k = \frac{17.63n - 23.64}{7.74n - 3.71}, $$

where $n$ is the sample size.

In this post, we compare the classic Tukey’s fences with $k=1.5$ and $k=3.0$ against Carling’s modification.

Central limit theorem and log-normal distribution

Tue, 19 Sep 2023 00:00:00 +0000

It is inconvenient to work with samples from a distribution of unknown form. Therefore, researchers often switch to considering the sample mean value and hope that thanks to the central limit theorem, the distribution of the sample means should be approximately normal. They say that if we consider samples of size $n \geq 30$, we can expect practically acceptable convergence to normality thanks to Berry–Esseen theorem. Indeed, this statement is almost valid for many real data sets. However, we can actually expect the applicability of this approach only for light-tailed distributions. In the case of heavy-tailed distributions, converging to normality is so slow, that we cannot imply the normality assumption for the distribution of the sample means. In this post, I provide an illustration of this effect using the log-normal distribution.

Hodges-Lehmann Gaussian efficiency: location shift vs. shift of locations

Tue, 12 Sep 2023 00:00:00 +0000

Let us consider two samples $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ and $\mathbf{y} = (y_1, y_2, \ldots, y_m)$. The one-sample Hodges-Lehman location estimator is defined as the median of the Walsh (pairwise) averages:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right), \quad \operatorname{HL}(\mathbf{y}) = \underset{1 \leq i \leq j \leq m}{\operatorname{median}} \left(\frac{y_i + y_j}{2} \right). $$

For these two samples, we can also define the shift between these two estimations:

$$ \Delta_{\operatorname{HL}}(\mathbf{x}, \mathbf{y}) = \operatorname{HL}(\mathbf{x}) - \operatorname{HL}(\mathbf{y}). $$

The two-sample Hodges-Lehmann location shift estimator is defined as the median of pairwise differences:

$$ \operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right). $$

Previously, I already compared the location shift estimator with the difference of median estimators (1, 2). In this post, I compare the difference between two location estimations and the shift estimations in terms of Gaussian efficiency. Before I started this study, I expected that $\operatorname{HL}$ should be more efficient than $\Delta_{\operatorname{HL}}$. Let us find out if my intuition is correct or not!

Thoughts on automatic statistical methods and broken assumptions

Tue, 05 Sep 2023 00:00:00 +0000

In the old times of applied statistics existence, all statistical experiments used to be performed by hand. In manual investigations, an investigator is responsible not only for interpreting the research results but also for the applicability validation of the used statistical approaches. Nowadays, more and more data processing is performed automatically on enormously huge data sets. Due to the extraordinary number of data samples, it is often almost impossible to verify each output individually using human eyes. Unfortunately, since we typically have no full control over the input data, we cannot guarantee certain assumptions that are required by classic statistical methods. These assumptions can be violated not only due to real-life phenomena we were not aware of during the experiment design stage, but also due to data corruption. In such corner cases, we may get misleading results, wrong automatic decisions, unacceptably high Type I/II error rates, or even a program crash because of a division by zero or another invalid operation. If we want to make an automatic analysis system reliable and trustworthy, the underlying mathematical procedures should correctly process malformed data. The normality assumption is probably the most popular one. There are well-known methods of robust statistics that focus only on slight deviations from normality and the appearance of extreme outliers. However, it is only a violation of one specific consequence from the normality assumption: light-tailedness. In practice, this sub-assumption is often interpreted as “the probability of observing extremely large outliers is negligible.” Meanwhile, there are other implicit derived sub-assumptions: continuity (we do not expect tied values in the input samples), symmetry (we do not expect highly-skewed distributions), unimodality (we do not expect multiple modes), nondegeneracy (we do not expect all sample values to be equal), sample size sufficiency (we do not expect extremely small samples like single-element samples), and others.

Ratio estimator based on the Hodges-Lehmann approach

Tue, 29 Aug 2023 00:00:00 +0000

For two samples $\mathbf{x} = ( x_1, x_2, \ldots, x_n )$ and $\mathbf{y} = ( y_1, y_2, \ldots, y_m )$, the Hodges-Lehmann location shift estimator is defined as follows:

$$ \operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right). $$

Now, let us consider the problem of estimating the ratio of the location measures instead of the shift between them. While there are multiple approaches to providing such an estimation, one of the options that can be considered is based on the Hodges-Lehmann ideas.

Weighted Mann-Whitney U test, Part 2

Tue, 22 Aug 2023 00:00:00 +0000

Previously, I suggested a weighted version of the Mann–Whitney $U$ test. The distribution of the weighted normalized $U_\circ^\star$ can be obtained via bootstrap. However, it is always nice if we can come up with an exact solution for the statistic distribution or at least provide reasonable approximations. In this post, we start exploring this distribution.

Exploring the power curve of the Cucconi test

Tue, 15 Aug 2023 00:00:00 +0000

The Cucconi test is a nonparametric two-sample test that compares both location and scale. It is a classic example of the family of tests that perform such a comparison simultaneously instead of combining the results of a location test and a scale test. Intuitively, such an approach should fit well unimodal distributions. Moreover, it has the potential to outperform more generic nonparametric tests that do not rely on the unimodality assumption.

In this post, we briefly show the equations behind the Cucconi test and present a power curve that compares it with the Student’s t-test and the Mann-Whitney U test under normality.

Parametric, Nonparametric, Robust, and Defensive statistics

Tue, 08 Aug 2023 00:00:00 +0000

Recently, I started writing about defensive statistics. The methodology allows having parametric assumptions, but it adjusts statistical methods so that they continue working even in the case of huge deviations from the declared assumptions. This idea sounds quite similar to nonparametric and robust statistics. In this post, I briefly explain the difference between different statistical methodologies.

Insidious implicit statistical assumptions

Tue, 01 Aug 2023 00:00:00 +0000

Recently, I was rereading hampel1986 and I found this quote about the difference between robust and nonparametric statistics (page 9):

Robust statistics considers the effects of only approximate fulfillment of assumptions, while nonparametric statistics makes rather weak but nevertheless strict assumptions (such as continuity of distribution or independence).

This statement may sound obvious. Unfortunately, facts that are presumably obvious in general are not always so obvious at the moment. When a researcher works with specific types of distributions for a long time, the properties of these distributions may be transformed into implicit assumptions. This implicitness can be pretty dangerous. If an assumption is explicitly declared, it can become a starting point for a discussion on how to handle violations of this assumption. The implicit assumptions are hidden and therefore conceal potential issues in cases when the collected data do not meet our expectations.

A switch from parametric to nonparametric methods is sometimes perceived as a rejection of all assumptions. Such a perception can be hazardous. While the original parametric assumption is actually neglected, many researchers continue to act like the implicit consequences of this assumption are still valid.

Since normality is the most popular parametric assumption, I would like to briefly discuss connected implicit assumptions that are often perceived not as non-validated hypotheses, but as essential properties of the collected data.

Four main books on robust statistics

Tue, 25 Jul 2023 00:00:00 +0000

Robust statistics is a practical and pragmatic branch of statistics. If you want to design reliable and trustworthy statistical procedures, the knowledge of robust statistics is essential. Unfortunately, it’s a challenging topic to learn.

In this post, I share my favorite books on robust statistics. I cannot pick my favorite one: each book is good in its own way, and all of them complement each other. I am returning to these books periodically to reinforce and expand my understanding of the topic.

Multimodal distributions and effect size

Tue, 18 Jul 2023 00:00:00 +0000

When we want to express the difference between two samples or distributions, a popular measure family is the effect sizes based on differences between means (difference family). When the normality assumption is satisfied, this approach works well thanks to classic measures of effect size like Cohen’s d, Glass’ Δ, or Hedges’ g. With slight deviations from normality, robust alternatives may be considered. To build such a measure, it’s enough to upgrade classic measures by replacing the sample mean with a robust measure of central tendency and replacing the standard deviation with a robust measure of dispersion. However, it might not be enough in the case of large deviations from normality. In this post, I briefly discuss the problem of effect size evaluation in the context of multimodal distributions.

Unobvious limitations of R *signrank Wilcoxon Signed Rank functions

Tue, 11 Jul 2023 00:00:00 +0000

In R, we have functions to calculate the density, distribution function, and quantile function of the Wilcoxon Signed Rank statistic distribution: dsignrank, psignrank, and qsignrank. All the functions use exact calculations of the target functions (the R 4.3.1 implementation can be found here). The exact approach works excellently for small sample sizes. Unfortunately, for large sample sizes, it fails to provide the expected function values. Out of the box, there are no alternative approximation solutions that could allow us to get reasonable results. In this post, we investigate the limitations of these functions and provide sample size thresholds after which we might get invalid results.

Weighted Mann-Whitney U test, Part 1

Tue, 04 Jul 2023 00:00:00 +0000

Previously, I have discussed how to build weighted versions of various statistical methods. I have already covered weighted versions of various quantile estimators and the Hodges-Lehmann location estimator. Such methods can be useful in various tasks like the support of weighted mixture distributions or exponential smoothing. In this post, I suggest a way to build a weighted version of the Mann-Whitney U test.

Joining modes of multimodal distributions

Tue, 27 Jun 2023 00:00:00 +0000

Multimodality of distributions is a severe issue in statistical analysis. Comparing two multimodal distributions is a tricky challenge. The degree of this challenge depends on the number of existing modes. Switching from unimodal models to multimodal ones can be a controversial decision, potentially causing more problems than solutions. Hence, if we dare to increase the complexity of the considering models, we should be sure that this is an essential necessity. Even when we confidently detect a truly multimodal distribution, a unimodal model could be an acceptable approximation if it is sufficiently close to the true distribution. The simplicity of a unimodal model may make it preferable, even if it is less accurate. Of course, the research goals should always be taken into account when the particular model choice is being made.

Understanding the pitfalls of preferring the median over the mean

Tue, 20 Jun 2023 00:00:00 +0000

A common task in mathematical statistics is to aggregate a set of numbers $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ to a single “average” value. Such a value is usually called central tendency. There are multiple measures of central tendency. The most popular one is the arithmetic average or the mean:

$$ \overline{\mathbf{x}} = \left( x_1 + x_2 + \ldots + x_n \right) / n. $$

The mean is so popular not only thanks to its simplicity but also because it provides the best way to estimate the center of the perfect normal distribution. Unfortunately, the mean is not a robust measure. This means that a single extreme value $x_i$ may distort the mean estimation and lead to a non-reproducible value that has nothing in common with the “expected” central tendency. The actual real-life distributions are never normal. They can be pretty close to the normal distribution, but only to a certain extent. Even small deviations from normality may produce occasional extreme outliers, which makes the mean an unreliable measure in the general case.

When people discover the danger of the mean, they start looking for a more robust measure of the central tendency. And the first obvious alternative is the sample median $\tilde{\mathbf{x}}$. The classic sample median is easy to calculate. First, you have to sort the sample. If the sample size $n$ is odd, the median is the middle element in the sorted sample. If $n$ is even, the median is the arithmetic average of the two middle elements in the sorted sample. The median is extremely robust: it provides a reasonable estimate even if almost half of the sample elements are corrupted.

For symmetric distributions (including the normal one), the true values of the mean and the median are the same. Once we discover the high robustness of the median, it may be tempting to always use the median instead of the mean. The median is often perceived as “something like the mean but with high resistance to outliers.” Indeed, what is the point of using the unreliable mean, if the median always provides a safer choice? Should we make the median our default option for the central tendency?

The answer is no. You should beware of any default options in mathematical statistics. All the measures are just tools, and each tool has its limitations and areas of applicability. A mindless transition from the mean to the median, regardless of the underlying distribution, is not a smart move. When we are picking a measure of central tendency to use, the first step should be reviewing the research goals: why do we need a measure of central tendency, and what are we going to do with the result? It’s impossible to make a rational decision on the statistical methods used without a clear understanding of the goals. Next, we should match the goals to the properties of available measures.

There are multiple practical issues with the median, but the most noticeable problem in practice is about its statistical efficiency. Understanding this problem reveals the price of advanced robustness of the median. In this post, we discuss the concept of statistical efficiency, estimate the statistical efficiency of the mean and the median under different distributions, and consider the Hodges-Lehman estimator as a measure of central tendency that provides a better trade-off between robustness and efficiency.

Introducing the defensive statistics

Tue, 13 Jun 2023 00:00:00 +0000

Normal or approximately normal subjects are less useful objects of research than their pathological counterparts.

— Sigmund Freud, “The Psychopathology of Everyday Life”

In the realm of software development, reliability is crucial. This is especially true when creating systems that automatically analyze performance measurements to maintain optimal application performance. To achieve the desired level of reliability, we need a set of statistical approaches that provide accurate and trustworthy results. These approaches must work even when faced with varying input data sets and multiple violated assumptions, including malformed and corrupted values. In this blog post, I introduce “Defensive Statistics” as an appropriate methodology for tackling this challenge.

Edgeworth expansion for the Mann-Whitney U test, Part 2: increased accuracy

Tue, 06 Jun 2023 00:00:00 +0000

In the previous post, we showed how the Edgeworth expansion can improve the accuracy of obtained p-values in the Mann-Whitney U test. However, we considered only the Edgeworth expansion to terms of order $1/m$. In this post, we explore how to improve the accuracyk of this approach using the Edgeworth expansion to terms of order $1/m^2$.

Edgeworth expansion for the Mann-Whitney U test

Tue, 30 May 2023 00:00:00 +0000

In previous posts, I have shown a severe drawback of the classic Normal approximation for the Mann-Whitney U test: under certain conditions, can lead to quite substantial p-value errors, distorting the significance level of the test.

In this post, we will explore the potential of the Edgeworth expansion as a more accurate alternative for approximating the distribution of the Mann-Whitney U statistic.

Confusing tie correction in the classic Mann-Whitney U test implementation

Tue, 23 May 2023 00:00:00 +0000

In this post, we discuss the classic implementation of the Mann-Whitney U test for cases in which the considered samples contain tied values. This approach is used the same way in all the popular statistical packages.

Unfortunately, in some situations, this approach produces confusing p-values, which may be surprising for researchers who do not have a deep understanding of ties correction. Moreover, some statistical textbooks argue against the validity of the default tie correction. The controversialness and counterintuitiveness of this approach may become a severe issue which may lead to incorrect experiment design and flawed result interpretation. In order to prevent such problems, it is essential to clearly understand the actual impact of tied observations on the true p-value and the impact of tie correction on the approximated p-value estimation. In this post, we discuss the tie correction for the Mann-Whitney U test and review examples that illustrate potential problems. We also provide examples of the Mann-Whitney U test implementations from popular statistical packages: wilcox.test from stats (R), mannwhitneyu from SciPy (Python), and MannWhitneyUTest from HypothesisTests (Julia). At the end of the post, we discuss how to avoid possible problems related to the tie correction.

Efficiency of the central tendency measures under the uniform distribution

Tue, 16 May 2023 00:00:00 +0000

Statistical efficiency is one of the primary ways to compare various estimators. Since the normality assumption is often used, Gaussian efficiency (efficiency under the normality distribution) is typically considered. For example, the asymptotic Gaussian efficiency values of the median and the Hodges-Lehmann location estimator (the pseudo-median) are $\approx 64\%$ and $\approx 96\%$ respectively (assuming the baseline is the mean).

But what if the underlying distribution is not normal, but uniform? What would happen to the relative statistical efficiency values in this case? Let’s find out! In this post, we calculate the relative efficiency of the median, the Hodges-Lehmann location estimator, and the midrange to the mean under the uniform distribution (or under uniformity).

Unobvious problems of using the R's implementation of the Hodges-Lehmann estimator

Tue, 09 May 2023 00:00:00 +0000

The Hodges-Lehmann location estimator (also known as pseudo-median) is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right). $$

Essentially, it’s the median of the Walsh (pairwise) averages.

For two samples $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ and $\mathbf{y} = \{ y_1, y_2, \ldots, y_m \}$, we can also consider the Hodges-Lehmann location shift estimator:

$$ \operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right). $$

In R, both estimators are available via the wilcox.test function. Here is a usage example:

set.seed(1729)
x <- rnorm(2000, 5) # A sample of size 2000 from the normal distribution N(5, 1)
y <- rnorm(2000, 2) # A sample of size 2000 from the normal distribution N(2, 1)
wilcox.test(x, conf.int = TRUE)$estimate
# (pseudo)median
# 5.000984
wilcox.test(y, conf.int = TRUE)$estimate
# (pseudo)median
# 1.969096
wilcox.test(x, y, conf.int = TRUE)$estimate
# difference in location
# 3.031782

In most cases, this function works fine. However, there is an unobvious corner case, in which it returns wrong values. In this post, we discuss the underlying problem and provide a correct implementation for the Hodges-Lehmann estimators.

When Python's Mann-Whitney U test returns extremely distorted p-values

Tue, 02 May 2023 00:00:00 +0000

In the previous post, I have discussed a huge difference between p-values evaluated via the R implementation of the Mann-Whitney U test between the exact and asymptotic implementations. This issue is not unique only to R, it is relevant for other statistical packages in other languages as well. In this post, we review this problem in the Python package SciPy.

When R's Mann-Whitney U test returns extremely distorted p-values

Tue, 25 Apr 2023 00:00:00 +0000

The Mann–Whitney U test (also known as the Wilcoxon rank-sum test) is one of the most popular nonparametric statistical tests. In R, it can be accessed using the wilcox.test function, which has been available since R 1.0.0 (February 2000). With its extensive adoption and long-standing presence in R, the wilcox.test function has become a trusted tool for many researchers. But is it truly reliable, and to what extent can we rely on its accuracy by default?

In my work, I often encounter the task of comparing a large sample (e.g., of size 50+) with a small sample (e.g., of size 5). In some cases, the ranges of these samples do not overlap with each other, which is the extreme case of the Mann–Whitney U test: it gives the minimum possible p-value. In one of the previous posts, I presented the exact equation for such a p-value. If we compare two samples of sizes $n$ and $m$, the minimum p-value we can observe with the one-tailed Mann–Whitney U test is $1/C_{n+m}^n$. For example, if $n=50$ and $m=5$, we get $1/C_{55}^5 \approx 0.0000002874587$. Let’s check these calculations using R:

> wilcox.test(101:105, 1:50, alternative = "greater")$p.value
[1] 0.0001337028

The obtained p-value is $\approx 0.0001337028$, which is $\approx 465$ times larger than we expected! Have we discovered a critical bug in wilcox.test? Can we now trust this function? Let’s find out!

Preprint announcement: 'Weighted quantile estimators'

Tue, 18 Apr 2023 00:00:00 +0000

I have just published a preprint of a paper ‘Weighted quantile estimators’. It’s based on a series of my research notes that I have been writing since September 2020.

The paper preprint is available on arXiv: arXiv:2304.07265 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-wqe. You can cite it as follows:

Andrey Akinshin (2023) “Weighted quantile estimators” arXiv:2304.07265

Abstract:

In this paper, we consider a generic scheme that allows building weighted versions of various quantile estimators, such as traditional quantile estimators based on linear interpolation of two order statistics, the Harrell-Davis quantile estimator and its trimmed modification. The obtained weighted quantile estimators are especially useful in the problem of estimating a distribution at the tail of a time series using quantile exponential smoothing. The presented approach can also be applied to other problems, such as quantile estimation of weighted mixture distributions.

Rethinking Type I/II error rates with power curves

Tue, 11 Apr 2023 00:00:00 +0000

When it comes to the analysis of a statistical significance test design, many people tend to overfocus purely on the Type I error rate. Those who are aware of the importance of power analysis often stop at expressing the Type II error rate as a single number. It is better than nothing, but such an approach always confuses me.

Let us say that the declared Type II error rate is 20% (or the declared statistical power is 80%). What does it actually mean? If the sample size and the significance level (or any other significance criteria) are given, the Type II error rate is a function of the effect size. When we express the Type II error rate as a single number, we always (implicitly or explicitly) assume the target effect size. In most cases, it is an arbitrary number that is somehow chosen to reflect our expectations of the “reasonable” effect size. However, the actual Type II error rate and the corresponding statistical power depend on the actual effect size that we do not know. Some researchers estimate the Type II error rate / statistical power using the measured effect size, but it does not make a lot of sense since it does not provide new information in addition to the measured effect size or p-value. In reality, we have high statistical power (low Type II error rate) for large effect sizes and low statistical power (high Type II error rate) for small effect sizes. Without the knowledge of the actual effect size (which we do not have), the Type II error rate expressed as a single number mostly describes this arbitrarily chosen expected effect size, rather than the actual properties of our statistical test.

Adaptation of continuous scale measures to discrete distributions

Tue, 04 Apr 2023 00:00:00 +0000

In statistics, it is often important to have a reliable measure of scale since it is required for estimating many types of the effect size and for statistical tests. If we work with continuous distributions, there are plenty of available scale measures with various levels of statistical efficiency and robustness. However, when distribution becomes discrete (e.g. because of the limited resolution of the measure tools), classic measures of scale can collapse to zero due to tied values in collected samples. This can be a severe problem in the analysis since the scale measures are often used as denominators in various equations. To make the calculations more reliable, it is important to handle such situations somehow and ensure that the target scale measure never becomes zero. In this post, I discuss a simple approach to work around this problem and adapt any given measure of scale to the discrete case.

Weighted modification of the Hodges-Lehmann location estimator

Tue, 28 Mar 2023 00:00:00 +0000

The classic Hodges-Lehmann location estimator is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{1 \leq i < j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right). $$

This estimator works great for non-weighted samples (its asymptotic Gaussian efficiency is $\approx 96\%$, and its asymptotic breakdown point is $\approx 29\%$). However, in real-world applications, data points may have varying importance or relevance. For example, in finance, different stocks may have different market capitalizations, which can impact the overall performance of an index. In social science research, survey responses may be weighted based on demographic representation to ensure that the final results are more generalizable. In software performance measurements, the observations may be collected from different source code revisions, some of which may be obsolete. In these cases, the classic $\operatorname{HL}$-measure is not suitable, as it treats each data point equally.

We can overcome this problem using weighted samples to obtain more accurate and meaningful central tendency estimates. Unfortunately, there is no well-established definition of the weighted Hodges-Lehmann location estimator. In this blog post, we introduce such a definition so that we can apply this estimator to weighted samples keeping it compatible with the original version.

Performance stability of GitHub Actions

Tue, 21 Mar 2023 00:00:00 +0000

Nowadays, GitHub Actions is one of the most popular free CI systems. It’s quite convenient to use it to run unit and integration tests. However, some developers try to use it to run benchmarks and performance tests. Unfortunately, default GitHub Actions build agents do not provide a consistent execution environment from the performance point of view. Therefore, performance measurements from different builds can not be compared. This makes it almost impossible to set up reliable performance tests based on the default GitHub Actions build agent pool.

So, it’s expected that the execution environments are not absolutely identical. But how bad is the situation? What’s the maximum difference between performance measurements from different builds? Is there a chance that we can play with thresholds and utilize GitHub Actions to detect at least major performance degradations? Let’s find out!

p-value distribution of the Brunner–Munzel test in the finite case

Tue, 14 Mar 2023 00:00:00 +0000

In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.

Comparing statistical power of the Mann-Whitney U test and the Brunner-Munzel test

Tue, 07 Mar 2023 00:00:00 +0000

In this post, we perform a short numerical simulation to compare the statistical power of the Mann-Whitney U test and the Brunner-Munzel test under normality for various sample sizes and significance levels.

p-value distribution of the Mann–Whitney U test in the finite case

Tue, 28 Feb 2023 00:00:00 +0000

When we work with null hypothesis significance testing and the null hypothesis is true, the distribution of observed p-value is asymptotically uniform. However, the distribution shape is not always uniform in the finite case. For example, when we work with rank-based tests like the Mann–Whitney U test, the distribution of the p-values is discrete with a limited set of possible values. This should be taken into account when we design a testing procedure for small samples and choose the significance level.

Previously, we already discussed the minimum reasonable significance level of the Mann-Whitney U test for small samples. In this post, we explore the full distribution of the p-values for this case.

Corner case of the Brunner–Munzel test

Tue, 21 Feb 2023 00:00:00 +0000

The Brunner–Munzel test is a nonparametric significance test, which can be considered an alternative to the Mann–Whitney U test. However, the Brunner–Munzel test has a corner case that can cause some practical issues with applying this test to real data. In this post, I briefly discuss the test itself and the corresponding corner case.

Examples of the Mann–Whitney U test misuse cases

Tue, 14 Feb 2023 00:00:00 +0000

The Mann–Whitney U test is one of the most popular nonparametric statistical tests. Its alternative hypothesis claims that one distribution is stochastically greater than the other. However, people often misuse this test and try to apply it to check if two nonparametric distributions are not identical or that there is a difference in distribution medians (while there are no additional assumptions on the shapes of the distributions). In this post, I show several cases in which the Mann–Whitney U test is not applicable for comparing two distributions.

Types of finite-sample consistency with the standard deviation

Tue, 07 Feb 2023 00:00:00 +0000

Let us say we have a robust dispersion estimator $\operatorname{T}(X)$. If it is asymptotically consistent with the standard deviation, we can use such an estimator as a robust replacement for the standard deviation under normality. Thanks to asymptotical consistency, we can use the estimator “as is” for large samples. However, if the number of sample elements is small, we typically need finite-sample bias-correction factors to make the estimator unbiased. Here we should clearly understand what kind of consistency we need.

There are various ways to estimate the standard deviation. Let us consider a sample of random variables $X = \{ X_1, X_2, \ldots, X_n \}$. The most popular equation of the standard deviation is given by

$$ s(X) = \sqrt{\frac{1}{n - 1} \sum_{i=1}^n (X_i - \overline{X})^2}. $$

Using this definition, we can get an unbiased estimator for the population variance: $\mathbb{E}[s^2(X)] = 1$. However, it is a biased estimator for the population standard deviation: $\mathbb{E}[s(X)] \neq 1$. To obtain to corresponding unbiased estimator, we should use $s(\mathbf{x}) \cdot c_4(n)$, where $c_4(n)$ is a correction factor defined as follows:

$$ c_4(n) = \sqrt{\frac{2}{n-1}} \cdot \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n-1}{2}\right)}. $$

When we define finite-sample bias-correction factors for a robust standard deviation replacement, we should choose which kind of consistency we need. In this post, I briefly explore available options.

Debunking the myth about ozone holes, NASA, and outlier removal

Tue, 31 Jan 2023 00:00:00 +0000

Imagine you work with some data and assume that the underlying distribution is approximately normal. In such cases, the data analysis typically involves non-robust statistics like the mean and the standard deviation. While these metrics are highly efficient under normality, they make the analysis procedure fragile: a single extreme value can corrupt all the results. You may not expect any significant outliers, but you can never be 100% sure. To avoid surprises and ensure the reliability of the results, it may be tempting to automatically exclude all outliers from the collected samples. While this approach is widely adopted, it conceals an essential part of the obtained data and can lead to fallacious conclusions.

Let me recite a classic story about ozone holes from kandel1990, which is typically used to illustrate the danger of blind outlier removal:

Nonparametric effect size: Cohen's d vs. Glass's delta

Tue, 24 Jan 2023 00:00:00 +0000

In the previous posts, I discussed the idea of nonparametric effect size measures consistent with Cohen’s d under normality. However, Cohen’s d is not always the best effect size measure, even in the normal case.

In this post, we briefly discuss a case study in which a nonparametric version of Glass’s delta is preferable than the previously suggested Cohen’s d-consistent measure.

Trinal statistical thresholds

Tue, 17 Jan 2023 00:00:00 +0000

When we design a test for practical significance, which compares two samples, we should somehow express the threshold. The most popular options are the shift, the ratio, and the effect size. Unfortunately, if we have little information about the underlying distributions, it’s hard to get a reliable test based only on a single threshold. And it’s almost impossible to define a generic threshold that fits all situations. After struggling with a lot of different thresholding approaches, I came up with the idea of setting a trinal threshold that includes three individual thresholds for the shift, the ratio, and the effect size.

In this post, I show some examples in which a single threshold is not enough.

Trimmed Hodges-Lehmann location estimator, Part 2: Gaussian efficiency

Tue, 10 Jan 2023 00:00:00 +0000

In the previous post, we introduced the trimmed Hodges-Lehman location estimator. For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, it is defined as follows:

$$ \operatorname{THL}(\mathbf{x}, k) = \underset{k < i < j \leq n - k}{\operatorname{median}}\biggl(\frac{x_{(i)} + x_{(j)}}{2}\biggr). $$

We also derived the exact expression for its asymptotic and finite-sample breakdown point values. In this post, we explore its Gaussian efficiency.

Trimmed Hodges-Lehmann location estimator, Part 1: breakdown point

Tue, 03 Jan 2023 00:00:00 +0000

For a sample $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$, the Hodges-Lehmann location estimator is defined as follows:

$$ \operatorname{HL}(\mathbf{x}) = \underset{i < j}{\operatorname{median}}\biggl(\frac{x_i + x_j}{2}\biggr). $$

Its asymptotic Gaussian efficiency is $\approx 96\%$, while its asymptotic breakdown point is $\approx 29\%$. This makes the Hodges-Lehmann location estimator a decent robust alternative to the mean.

While the Gaussian efficiency is quite impressive (almost as efficient as the mean), the breakdown point is not as great as in the case of the median (which has a breakdown point of $50\%$). Could we change this trade-off a little bit and make this estimator more robust, sacrificing a small portion of efficiency? Yes, we can!

In this post, I want to present the idea of the trimmed Hodges-Lehmann location estimator and provide the exact equation for its breakdown point.

Median of the shifts vs. shift of the medians, Part 2: Gaussian efficiency

Tue, 27 Dec 2022 00:00:00 +0000

In the previous post, we discussed the difference between shifts of the medians and the Hodges-Lehmann location shift estimator. In this post, we conduct a simple numerical simulation to evaluate the Gaussian efficiency of these two estimators.

Median of the shifts vs. shift of the medians, Part 1

Tue, 20 Dec 2022 00:00:00 +0000

Let us say that we have two samples $x = \{ x_1, x_2, \ldots, x_n \}$, $y = \{ y_1, y_2, \ldots, y_m \}$, and we want to estimate the shift of locations between them. In the case of the normal distribution, this task is quite simple and has a lot of straightforward solutions. However, in the nonparametric case, the location shift is an ambiguous metric which heavily depends on the chosen estimator. In the context of this post, we consider two approaches that may look similar. The first one is the shift of the medians:

$$ \newcommand{\DSM}{\Delta_{\operatorname{SM}}} \DSM = \operatorname{median}(y) - \operatorname{median}(x). $$

The second one of the median of all pairwise shifts, also known as the Hodges-Lehmann location shift estimator:

$$ \newcommand{\DHL}{\Delta_{\operatorname{HL}}} \DHL = \operatorname{median}(y_j - x_i). $$

In the case of the normal distributions, these estimators are consistent. However, this post will show an example of multimodal distributions that lead to opposite signs of $\DSM$ and $\DHL$.

Resistance to the low-density regions: the Hodges-Lehmann location estimator

Tue, 13 Dec 2022 00:00:00 +0000

In the previous posts, I discussed the concept of a resistance function that shows the sensitivity of the given estimator to the low-density regions. I already showed how this function behaves for the mean, the sample median, and the Harrell-Davis median. In this post, I explore this function for the Hodges-Lehmann location estimator.

Kernel density estimation boundary correction: reflection (ggplot2 v3.4.0)

Tue, 06 Dec 2022 00:00:00 +0000

Kernel density estimation (KDE) is a popular way to approximate a distribution based on the given data. However, it has several flaws. One of the most significant flaws is that it extends the support of the distribution. It is pretty unfortunate: even if we know the actual range of supported values, KDE provides non-zero density values for the regions where no values exist. It is obviously an inaccurate estimation. The procedure of adjusting the KDE values according to the given boundaries is known as boundary correction. As usual, there are plenty of available boundary correction strategies.

One such strategy was implemented in the v3.4.0 update of ggplot2 (a popular R package for plotting) thanks to pull request #4013. At the present moment, it supports a single boundary correction strategy called reflection. In this post, we discuss this approach and see how it works in practice.

Sheather & Jones vs. unbiased cross-validation

Tue, 29 Nov 2022 00:00:00 +0000

In the post about the importance of kernel density estimation bandwidth, I reviewed several bandwidth selectors and showed their impact on the KDE. The classic selectors like Scott’s rule of thumb or Silverman’s rule of thumb are designed for the normal distribution and perform purely in non-parametric cases. One of the most significant caveats is that they can mask multimodality. The same problem is also relevant to the biased cross-validation method. Among all the bandwidth selectors available in R, only Sheather & Jones and unbiased cross-validation provide reliable results in the multimodal case. However, I always advocate using the Sheather & Jones method rather than the unbiased cross-validation approach.

In this post, I will show the drawbacks of the unbiased cross-validation method and what kind of problems we can get if we use it as a KDE bandwidth selector.

Resistance to the low-density regions: the Harrell-Davis median

Tue, 22 Nov 2022 00:00:00 +0000

In the previous post, we defined the resistance function that show sensitivity of the given estimator to the low-density regions. We also showed the resistance function plots for the mean and the sample median. In this post, we explore corresponding plots for the Harrell-Davis median.

Resistance to the low-density regions: the mean and the median

Tue, 15 Nov 2022 00:00:00 +0000

When we discuss resistant statistics, we typically assume resistance to extreme values. However, extreme values are not the only problem source that can violate usual assumptions about expected metric distribution. The low-density regions which often arise in multimodal distributions can also corrupt the results of the statistical analysis. In this post, I discuss this problem and introduce a measure of resistance to low-density regions.

Finite-sample Gaussian efficiency of the trimmed Harrell-Davis median estimator

Tue, 08 Nov 2022 00:00:00 +0000

In the previous post, we obtained the finite-sample Gaussian efficiency values of the sample median and the Harrell-Davis median. In this post, we extended these results and get the finite-sample Gaussian efficiency values of the trimmed Harrell-Davis median estimator based on the highest density interval of the width $1/\sqrt{n}$.

Finite-sample Gaussian efficiency of the Harrell-Davis median estimator

Tue, 01 Nov 2022 00:00:00 +0000

In this post, we explore finite-sample and asymptotic Gaussian efficiency values of the sample median and the Harrell-Davis median.

Weighted quantile estimation for a weighted mixture distribution

Tue, 25 Oct 2022 00:00:00 +0000

Let $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$ be a sample of size $n$. We assign non-negative weight coefficients $w_i$ with a positive sum for all sample elements:

$$ \mathbf{w} = \{ w_1, w_2, \ldots, w_n \}, \quad w_i \geq 0, \quad \sum_{i=1}^{n} w_i > 0. $$

For simplification, we also consider normalized (standardized) weights $\overline{\mathbf{w}}$:

$$ \overline{\mathbf{w}} = \{ \overline{w}_1, \overline{w}_2, \ldots, \overline{w}_n \}, \quad \overline{w}_i = \frac{w_i}{\sum_{i=1}^{n} w_i}. $$

In the non-weighted case, we can consider a quantile estimator $\operatorname{Q}(\mathbf{x}, p)$ that estimates the $p^\textrm{th}$ quantile of the underlying distribution. We want to build a weighted quantile estimator $\operatorname{Q}(\mathbf{x}, \mathbf{w}, p)$ so that we can estimate the quantiles of a weighed sample.

In this post, we consider a specific problem of estimating quantiles of a weighted mixture distribution.

Preprint announcement: 'Finite-sample Rousseeuw-Croux scale estimators'

Tue, 18 Oct 2022 00:00:00 +0000

Recently, I published a preprint of a paper ‘Finite-sample Rousseeuw-Croux scale estimators’. It’s based on a series of my research notes.

The paper preprint is available on arXiv: arXiv:2209.12268 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-frc. You can cite it as follows:

Andrey Akinshin (2022) “Finite-sample Rousseeuw-Croux scale estimators” arXiv:2209.12268

Abstract:

The Rousseeuw-Croux $S_n$, $Q_n$ scale estimators and the median absolute deviation $\operatorname{MAD}_n$ can be used as consistent estimators for the standard deviation under normality. All of them are highly robust: the breakdown point of all three estimators is $50\%$. However, $S_n$ and $Q_n$ are much more efficient than $\operatorname{MAD}_n$: their asymptotic Gaussian efficiency values are $58\%$ and $82\%$ respectively compared to $37\%$ for $\operatorname{MAD}_n$. Although these values look impressive, they are only asymptotic values. The actual Gaussian efficiency of $S_n$ and $Q_n$ for small sample sizes is noticeably lower than in the asymptotic case.

The original work by Rousseeuw and Croux (1993) provides only rough approximations of the finite-sample bias-correction factors for $S_n,\, Q_n$ and brief notes on their finite-sample efficiency values. In this paper, we perform extensive Monte-Carlo simulations in order to obtain refined values of the finite-sample properties of the Rousseeuw-Croux scale estimators. We present accurate values of the bias-correction factors and Gaussian efficiency for small samples ($n \leq 100$) and prediction equations for samples of larger sizes.

Sensitivity curve of the Harrell-Davis quantile estimator, Part 3

Tue, 11 Oct 2022 00:00:00 +0000

In the previous posts (1, 2), I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution, the exponential distribution, and the Cauchy distribution. In this post, I build these sensitivity curves for some additional distributions.

Sensitivity curve of the Harrell-Davis quantile estimator, Part 2

Tue, 04 Oct 2022 00:00:00 +0000

In the previous post, I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution. In this post, I continue the same investigation on the exponential and Cauchy distributions.

Sensitivity curve of the Harrell-Davis quantile estimator, Part 1

Tue, 27 Sep 2022 00:00:00 +0000

The Harrell-Davis quantile estimator is an efficient replacement for the traditional quantile estimator, especially in the case of light-tailed distributions. Unfortunately, it is not robust: its breakdown point is zero. However, the breakdown point is not the only descriptor of robustness. While the breakdown point describes the portion of the distribution that should be replaced by arbitrary large values to corrupt the estimation, it does not describe the actual impact of finite outliers. The arithmetic mean also has the breakdown point of zero, but the practical robustness of the mean and the Harrell-Davis quantile estimator are not the same. The Harrell-Davis quantile estimator is an L-estimator that assigns extremely low weights to sample elements near the tails (especially, for reasonably large sample sizes). Therefore, the actual impact of potential outliers is not so noticeable. In this post, we use the standardized sensitivity curve to evaluate this impact.

Weighted quantile estimators for exponential smoothing and mixture distributions

Tue, 20 Sep 2022 00:00:00 +0000

There are various ways to estimate quantiles of weighted samples. The proper choice of the most appropriate weighted quantile estimator depends not only on the own estimator properties but also on the goal.

Let us consider two problems:

Estimating quantiles of a weighted mixture distribution.
In this problem, we have a weighted mixture distribution given by $F = \sum_{i=1}^m w_i F_i$. We collect samples $\mathbf{x_1}, \mathbf{x_2}, \ldots, \mathbf{x_m}$ from $F_1, F_2, \ldots F_m$, and want to estimate quantile function $F^{-1}$ of the mixture distribution based on the given samples.
Quantile exponential smoothing.
In this problem, we have a time series $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$. We want to describe the distribution “at the end” of this time series. The latest series element $x_n$ is the most “actual” one, but we cannot build a distribution based on a single element. Therefore, we have to consider more elements at the end of $\mathbf{x}$. However, if we take too many elements, we may corrupt the estimations due to obsolete measurements. To resolve this problem, we can assign weights to all elements according to the exponential law and estimate weighted quantiles.

In both problems, the usage of weighted quantile estimators looks like a reasonable solution. However, in each problem, we have different expectations of the estimator behavior. In this post, we provide an example that illustrates the difference in these expectations.

The Huggins-Roy family of effective sample sizes

Tue, 13 Sep 2022 00:00:00 +0000

When we work with weighted samples, it’s essential to introduce adjustments for the sample size. Indeed, let’s consider two following weighted samples:

$$ \mathbf{x}_1 = \{ x_1, x_2, \ldots, x_n \}, \quad \mathbf{w}_1 = \{ w_1, w_2, \ldots, w_n \}, $$ $$ \mathbf{x}_2 = \{ x_1, x_2, \ldots, x_n, x_{n+1} \}, \quad \mathbf{w}_2 = \{ w_1, w_2, \ldots, w_n, 0 \}. $$

Since the weight of $x_{n+1}$ in the second sample is zero, it’s natural to expect that both samples have the same set of properties. However, there is a major difference between $\mathbf{x}_1$ and $\mathbf{x}_2$: their sample sizes which are $n$ and $n+1$. In order to eliminate this difference, we typically introduce the effective sample size (ESS) which is estimated based on the list of weights.

There are various ways to estimate the ESS. In this post, we briefly discuss the Huggins-Roy’s family of ESS.

Finite-sample bias correction factors for Rousseeuw-Croux scale estimators

Tue, 06 Sep 2022 00:00:00 +0000

The Rousseeuw-Croux scale estimators $S_n$ and $Q_n$ are efficient alternatives to the median absolute deviation ($\operatorname{MAD}_n$). While all three estimators have the same breakdown point of $50\%$, $S_n$ and $Q_n$ have higher statistical efficiency than $\operatorname{MAD}_n$. The asymptotic Gaussian efficiency values of $\operatorname{MAD}_n$, $S_n$, and $Q_n$ are $37\%$, $58\%$, and $82\%$ respectively.

Using scale constants, we can make $S_n$ and $Q_n$ consistent estimators for the standard deviation under normality. The asymptotic values of these constants are well-known. However, for finite-samples, only approximated scale constants are known. In this post, we provide refined values of these constants with higher accuracy.

Preprint announcement: 'Quantile absolute deviation'

Thu, 01 Sep 2022 00:00:00 +0000

I have just published a preprint of a paper ‘Quantile absolute deviation’. It’s based on a series of my research notes that I have been writing since December 2020.

The paper preprint is available on arXiv: arXiv:2208.13459 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-qad. You can cite it as follows:

Andrey Akinshin (2022) “Quantile absolute deviation” arXiv:2208.13459

Abstract:

The median absolute deviation (MAD) is a popular robust measure of statistical dispersion. However, when it is applied to non-parametric distributions (especially multimodal, discrete, or heavy-tailed), lots of statistical inference issues arise. Even when it is applied to distributions with slight deviations from normality and these issues are not actual, the Gaussian efficiency of the MAD is only 37% which is not always enough.

In this paper, we introduce the quantile absolute deviation (QAD) as a generalization of the MAD. This measure of dispersion provides a flexible approach to analyzing properties of non-parametric distributions. It also allows controlling the trade-off between robustness and statistical efficiency. We use the trimmed Harrell-Davis median estimator based on the highest density interval of the given width as a complimentary median estimator that gives increased finite-sample Gaussian efficiency compared to the sample median and a breakdown point matched to the QAD.

As a rule of thumb, we suggest using two new measures of dispersion called the standard QAD and the optimal QAD. They give 54% and 65% of Gaussian efficiency having breakdown points of 32% and 14% respectively.

Standard trimmed Harrell-Davis median estimator

Wed, 31 Aug 2022 00:00:00 +0000

In one of the previous posts, I suggested a new measure of dispersion called the standard quantile absolute deviation around the median ($\operatorname{SQAD}$) which can be used as an alternative to the median absolute deviation ($\operatorname{MAD}$) as a consistent estimator for the standard deviation under normality. The Gaussian efficiency of $\operatorname{SQAD}$ is $54\%$ (comparing to $37\%$ for MAD), and its breakdown point is $32\%$ (comparing to $50\%$ for MAD). $\operatorname{SQAD}$ is a symmetric dispersion measure around the median: the interval $[\operatorname{Median} - \operatorname{SQAD}; \operatorname{Median} + \operatorname{SQAD}]$ covers $68\%$ of the distribution. In the case of the normal distribution, this corresponds to the interval $[\mu - \sigma; \mu + \sigma]$.

If we use $\operatorname{SQAD}$, we accept the breakdown point of $32\%$. This makes the sample median a non-optimal choice for the median estimator. Indeed, the sample median has high robustness (the breakdown point is $50\%$), but relatively poor Gaussian efficiency. If we use $\operatorname{SQAD}$, it doesn’t make sense to require a breakdown point of more than $32\%$. Therefore, we could trade the median robustness for efficiency and come up with a complementary measure of the median for $\operatorname{SQAD}$.

In this post, we introduce the standard trimmed Harrell-Davis median estimator which shares the breakdown point with $\operatorname{SQAD}$ and provides better finite-sample efficiency comparing to the sample median.

Optimal quantile absolute deviation

Tue, 30 Aug 2022 00:00:00 +0000

We consider the quantile absolute deviation around the median defined as follows:

$$ \newcommand{\E}{\mathbb{E}} \newcommand{\PR}{\mathbb{P}} \newcommand{\Q}{\operatorname{Q}} \newcommand{\OQAD}{\operatorname{OQAD}} \newcommand{\QAD}{\operatorname{QAD}} \newcommand{\median}{\operatorname{median}} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\SD}{\operatorname{SD}} \newcommand{\V}{\mathbb{V}} \QAD(X, p) = K_p \Q(|X - \median(X)|, p), $$

where $\Q$ is a quantile estimator, and $K_p$ is a scale constant which we use to make $\QAD(X, p)$ an asymptotically consistent estimator for the standard deviation under the normal distribution.

In this post, we get the exact values of the $K_p$ values, derive the corresponding equation for the asymptotic Gaussian efficiency of $\QAD(X, p)$, and find the point in which $\QAD(X, p)$ achieves the highest Gaussian efficiency.

Quantile absolute deviation of the Pareto distribution

Mon, 29 Aug 2022 00:00:00 +0000

In this post, we derive the exact equation for the quantile absolute deviation around the median of the Pareto(1,1) distribution.

Quantile absolute deviation of the Exponential distribution

Fri, 26 Aug 2022 00:00:00 +0000

In this post, we derive the exact equation for the quantile absolute deviation around the median of the Exponential distribution.

Quantile absolute deviation of the Uniform distribution

Thu, 25 Aug 2022 00:00:00 +0000

In this post, we derive the exact equation for the quantile absolute deviation around the median of the Uniform distribution.

Quantile absolute deviation of the Normal distribution

Wed, 24 Aug 2022 00:00:00 +0000

In this post, we derive the exact equation for the quantile absolute deviation around the median of the Normal distribution.

Standard quantile absolute deviation

Tue, 23 Aug 2022 00:00:00 +0000

The median absolute deviation (MAD) is a popular robust replacement of the standard deviation (StdDev). It’s truly robust: its breakdown point is $50\%$. However, it’s not so efficient when we use it as a consistent estimator for the standard deviation under normality: the asymptotic relative efficiency against StdDev (we call it the Gaussian efficiency) is only about $\approx 37\%$.

In practice, such robustness is not always essential, while we typically want to have the highest possible efficiency. I already described the concept of the quantile absolute deviation which aims to provide a customizable trade-off between robustness and efficiency. In this post, I would like to suggest a new default option for this measure of dispersion called the standard quantile absolute deviation. Its Gaussian efficiency is $\approx 54\%$ while the breakdown point is $\approx 32\%$

Asymptotic Gaussian efficiency of the quantile absolute deviation

Tue, 16 Aug 2022 00:00:00 +0000

I have already discussed the concept of the quantile absolute deviation in several previous posts. In this post, we derive the equation for the relative statistical efficiency of the quantile absolute deviation against the standard deviation under the normal distribution (so call Gaussian efficiency).

Finite-sample efficiency of the Rousseeuw-Croux estimators

Tue, 09 Aug 2022 00:00:00 +0000

The Rousseeuw-Croux $S_n$ and $Q_n$ estimators are robust and efficient measures of scale. Their breakdown points are equal to $0.5$ which is also the breakdown point of the median absolute deviation (MAD). However, their statistical efficiency values are much better than the efficiency of MAD. To be specific, the MAD asymptotic relative Gaussian efficiency against the standard deviation is about $37\%$, whereas the corresponding values for $S_n$ and $Q_n$ are $58\%$ and $82\%$ respectively. Although these numbers are quite impressive, they are only asymptotic values. In practice, we work with finite samples. And the finite-sample efficiency could be much lower than the asymptotic one. In this post, we perform a simulation study in order to obtain the actual finite-sample efficiency values for these two estimators.

Caveats of using the median absolute deviation

Tue, 02 Aug 2022 00:00:00 +0000

The median absolute deviation is a measure of dispersion which can be used as a robust alternative to the standard deviation. It works great for slight deviations from normality (e.g., for contaminated normal distributions or slightly skewed unimodal distributions). Unfortunately, if we apply it to distributions with huge deviations from normality, we may experience a lot of troubles. In this post, I discuss some of the most important caveats which we should keep in mind if we use the median absolute deviation.

Preprint announcement: 'Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification'

Tue, 26 Jul 2022 00:00:00 +0000

I have just published a preprint of a paper ‘Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification’. It’s based on a series of my research notes that I have been writing since February 2021.

The paper preprint is available on arXiv: arXiv:2207.12005 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-mad-factors. You can cite it as follows:

Andrey Akinshin (2022) “Finite-sample bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification,” arXiv:2207.12005

Abstract:

The median absolute deviation is a widely used robust measure of statistical dispersion. Using a scale constant, we can use it as an asymptotically consistent estimator for the standard deviation under normality. For finite samples, the scale constant should be corrected in order to obtain an unbiased estimator. The bias-correction factor depends on the sample size and the median estimator. When we use the traditional sample median, the factor values are well known, but this approach does not provide optimal statistical efficiency. In this paper, we present the bias-correction factors for the median absolute deviation based on the Harrell-Davis quantile estimator and its trimmed modification which allow us to achieve better statistical efficiency of the standard deviation estimations. The obtained estimators are especially useful for samples with a small number of elements.

Challenges of change point detection in CI performance data

Tue, 19 Jul 2022 00:00:00 +0000

Change point detection is a popular task in various disciplines. There are many algorithms that solve this problem. For example, in truong2020, the authors presented a classification of different approaches and discussed 35 algorithms. However, not all the algorithms fit all the situations.

In this post, we consider the problem of change point detection in time series based on software performance measurements obtained from a continuous integration (CI) server. Examples of data sources are CI builds, unit tests, benchmarks, performance tests, and so on. We would like to automatically find performance degradations in such time series. Unfortunately, most of the available algorithms do not provide decent solutions for this problem. In this post, I discuss some challenges that arise when we are looking for change points in CI performance data.

Dynamical System Case Study 2 (Piecewise linear LLL-system)

Sun, 17 Jul 2022 00:00:00 +0000

We consider the following dynamical system:

$$ \begin{cases} \dot{x}_1 = L(a_1, k_1, x_3) - k_1 x_1,\\ \dot{x}_2 = L(a_2, k_2, x_1) - k_2 x_2,\\ \dot{x}_3 = L(a_3, k_3, x_2) - k_3 x_3, \end{cases} $$

where $L$ is a piecewise linear function:

$$ L(a, k, x) = \begin{cases} ak & \quad \textrm{for}\; 0 \leq x \leq 1,\\ 0 & \quad \textrm{for}\; 1 < x. \end{cases} $$

In this case study, we build a Shiny application that draws 3D phase portraits of this system for various sets of input parameters.

Degenerate point of dispersion estimators

Tue, 12 Jul 2022 00:00:00 +0000

Recently, I have been working on searching for a robust statistical dispersion estimator that doesn’t become zero on samples with a huge number of tied values. I have already created a few of such estimators like the middle non-zero quantile absolute deviation (part 1, part 2) and the untied quantile absolute deviation. Having several options to compare, we need a proper metric that allows us to perform such a comparison. Similar to the breakdown point (that is used to describe estimator robustness), we could introduce the degenerate point that describes the resistance of a dispersion estimator to the tied values. In this post, I will briefly describe this concept.

Untied quantile absolute deviation

Tue, 05 Jul 2022 00:00:00 +0000

In the previous posts, I tried to adapt the concept of the quantile absolute deviation to samples with tied values so that this measure of dispersion never becomes zero for nondegenerate ranges. My previous attempt was the middle non-zero quantile absolute deviation (modification 1, modification 2). However, I’m not completely satisfied with the behavior of this metric. In this post, I want to consider another way to work around the problem with tied values.

Middle non-zero quantile absolute deviation, Part 2

Tue, 28 Jun 2022 00:00:00 +0000

In one of the previous posts, I described the idea of the middle non-zero quantile absolute deviation. It’s defined as follows:

$$ \operatorname{MNZQAD}(x, p) = \operatorname{QAD}(x, p, q_m), $$ $$ q_m = \frac{q_0 + 1}{2}, \quad q_0 = \frac{\max(k - 1, 0)}{n - 1}, \quad k = \sum_{i=1}^n \mathbf{1}_{Q(x, p)}(x_i), $$

where $\mathbf{1}$ is the indicator function

$$ \mathbf{1}_U(u) = \begin{cases} 1 & \textrm{if}\quad u = U,\\ 0 & \textrm{if}\quad u \neq U, \end{cases} $$

and $\operatorname{QAD}$ is the quantile absolute deviation

$$ \operatorname{QAD}(x, p, q) = Q(|x - Q(x, p)|, q). $$

The $\operatorname{MNZQAD}$ approach tries to work around a problem with tied values. While it works well in the generic case, there are some corner cases where the suggested metric behaves poorly. In this post, we discuss this problem and how to solve it.

The expected number of takes from a discrete distribution before observing the given element

Tue, 21 Jun 2022 00:00:00 +0000

Let’s consider a discrete distribution $X$ defined by its probability mass function $p_X(x)$. We randomly take elements from $X$ until we observe the given element $x_0$. What’s the expected number of takes in this process?

This classic statistical problem could be solved in various ways. I would like to share one of my favorite approaches that involves the derivative of the series $\sum_{n=0}^\infty x^n$.

Folded medians

Tue, 14 Jun 2022 00:00:00 +0000

In the previous post, we discussed the Gastwirth’s location estimator. In this post, we continue playing with different location estimators. To be more specific, we consider an approach called folded medians. Let $x = \{ x_1, x_2, \ldots, x_n \}$ be a random sample with order statistics $\{ x_{(1)}, x_{(2)}, \ldots, x_{(n)} \}$. We build a folded sample using the following form:

$$ \Bigg\{ \frac{x_{(1)}+x_{(n)}}{2}, \frac{x_{(2)}+x_{(n-1)}}{2}, \ldots, \Bigg\}. $$

If $n$ is odd, the middle sample element is folded with itself. The folding operation could be applied several times. Once folding is conducted, the median of the final folded sample is the folded median. A single folding operation gives us the Bickel-Hodges estimator.

In this post, we briefly check how this metric behaves in the case of the Normal and Cauchy distributions.

Gastwirth's location estimator

Tue, 07 Jun 2022 00:00:00 +0000

Let $x = \{ x_1, x_2, \ldots, x_n \}$ be a random sample. The Gastwirth’s location estimator is defined as follows:

$$ 0.3 \cdot Q_{⅓}(x) + 0.4 \cdot Q_{½}(x) + 0.3 \cdot Q_{⅔}(x), $$

where $Q_p$ is an estimation of the $p^{\textrm{th}}$ quantile (using classic sample quantiles).

This estimator could be quite interesting from a practical point of view. On the one hand, it’s robust (the breakdown point ⅓) and it has better statistical efficiency than the classic sample median. On the other hand, it has better computational efficiency than other robust and statistical efficient measures of location like the Harrell-Davis median estimator or the Hodges-Lehmann median estimator.

In this post, we conduct a short simulation study that shows its behavior for the standard Normal distribution and the Cauchy distribution.

Dynamical System Case Study 1 (symmetric 3d system)

Sun, 05 Jun 2022 00:00:00 +0000

Let’s consider the following dynamical system:

$$ \begin{cases} \dot{x}_1 = f(x_3) - x_1,\\ \dot{x}_2 = f(x_1) - x_2,\\ \dot{x}_3 = f(x_2) - x_3, \end{cases} $$

where $f(x) = \alpha / (1+x^m)$ is a Hill function. In this case study, we explore the phase portrait of this system for $\alpha = 18,\; m = 3$.

Beeping Busy Beavers and twin prime conjecture

Wed, 01 Jun 2022 00:00:00 +0000

In this post, I use Beeping Busy Beavers to show that twin prime conjecture could be proven or disproven.

Hodges-Lehmann-Sen shift and shift confidence interval estimators

Tue, 31 May 2022 00:00:00 +0000

In the previous two posts (1, 2), I discussed the Hodges-Lehmann median estimator. The suggested idea of getting median estimations based on a cartesian product could be adopted to estimate the shift between two samples. In this post, we discuss how to build Hodges-Lehmann-Sen shift estimator and how to get confidence intervals for the obtained estimations. Also, we perform a simulation study that checks the actual coverage percentage of these intervals.

Statistical efficiency of the Hodges-Lehmann median estimator, Part 2

Tue, 24 May 2022 00:00:00 +0000

In the previous post, we evaluated the relative statistical efficiency of the Hodges-Lehmann median estimator against the sample median under the normal distribution. In this post, we extended this experiment to a set of various light-tailed and heavy-tailed distributions.

Statistical efficiency of the Hodges-Lehmann median estimator, Part 1

Tue, 17 May 2022 00:00:00 +0000

In this post, we evaluate the relative statistical efficiency of the Hodges-Lehmann median estimator against the sample median under the normal distribution. We also compare it with the efficiency of the Harrell-Davis quantile estimator.

Expected value of the maximum of two standard half-normal distributions

Tue, 10 May 2022 00:00:00 +0000

Let $X_1, X_2$ be i.i.d. random variables that follow the standard normal distribution $\mathcal{N}(0,1^2)$. In the previous post, I have found the expected value of $\min(|X_1|, |X_2|)$. Now it’s time to find the value of $Z = \max(|X_1|, |X_2|)$.

Expected value of the minimum of two standard half-normal distributions

Tue, 03 May 2022 00:00:00 +0000

Let $X_1, X_2$ be i.i.d. random variables that follow the standard normal distribution $\mathcal{N}(0,1^2)$. One day I wondered, what is the expected value of $Z = \min(|X_1|, |X_2|)$? It turned out to be a fun exercise. Let’s solve it together!

Unbiased median absolute deviation for n=2

Tue, 26 Apr 2022 00:00:00 +0000

I already covered the topic of the unbiased median deviation based on the traditional sample median, the Harrell-Davis quantile estimator, and the trimmed Harrell-Davis quantile estimator. In all the posts, the values of bias-correction factors were evaluated using the Monte-Carlo simulation. In this post, we calculate the exact value of the bias-correction factor for two-element samples.

Weighted trimmed Harrell-Davis quantile estimator

Tue, 19 Apr 2022 00:00:00 +0000

In this post, I combine ideas from two of my previous posts:

Trimmed Harrell-Davis quantile estimator: quantile estimator that provides an optimal trade-off between statistical efficiency and robustness
Weighted quantile estimators: a general scheme that allows building weighted quantile estimators. Could be used for quantile exponential smoothing and dispersion exponential smoothing.

Thus, we are going to build a weighted version of the trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width.

Minimum meaningful statistical level for the Mann–Whitney U test

Tue, 12 Apr 2022 00:00:00 +0000

The Mann–Whitney U test is one of the most popular nonparametric null hypothesis significance tests. However, like any statistical test, it has limitations. We should always carefully match them with our business requirements. In this post, we discuss how to properly choose the statistical level for the Mann–Whitney U test on small samples.

Let’s say we want to compare two samples $x = \{ x_1, x_2, \ldots, x_n \}$ and $y = \{ y_1, y_2, \ldots, y_m \}$ using the one-sided Mann–Whitney U test. Sometimes, we don’t have an opportunity to gather enough data and we have to work with small samples. Imagine that the size of both samples is six: $n=m=6$. We want to set the statistical level $\alpha$ to $0.001$ (because we really don’t want to get false-positive results). Is it a valid requirement? In fact, the minimum p-value we can observe with $n=m=6$ is $\approx 0.001082$. Thus, with $\alpha = 0.001$, it’s impossible to get a positive result. Meanwhile, everything is correct from the technical point of view: since we can’t get any positive results, the false positive rate is exactly zero which is less than $0.001$. However, it’s definitely not something that we want: with this setup the test becomes useless because it always provides negative results regardless of the input data.

This brings an important question: what is the minimum meaningful statistical level that we can require for the one-sided Mann–Whitney U test knowing the sample sizes?

Fence-based outlier detectors, Part 2

Tue, 05 Apr 2022 00:00:00 +0000

In the previous post, I discussed different fence-based outlier detectors. In this post, I show some examples of these detectors with different parameters.

Fence-based outlier detectors, Part 1

Tue, 29 Mar 2022 00:00:00 +0000

In previous posts, I discussed properties of Tukey’s fences and asymmetric decile-based outlier detector (Part 1, Part 2). In this post, I discuss the generalization of fence-based outlier detectors.

Publication announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'

Tue, 22 Mar 2022 00:00:00 +0000

Since the beginning of previous year, I have been working on building a quantile estimator that provides an optimal trade-off between statistical efficiency and robustness. At the end of the year, I published the corresponding preprint where I presented a description of such an estimator: arXiv:2111.11776 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-thdqe.

Finally, the paper was published in Communications in Statistics - Simulation and Computation. You can cite it as follows:

Andrey Akinshin (2022) Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width, Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918.2022.2050396

Asymmetric decile-based outlier detector, Part 2

Tue, 15 Mar 2022 00:00:00 +0000

In the previous post, I suggested an asymmetric decile-based outlier detector as an alternative to Tukey’s fences. In this post, we run some numerical simulations to check out the suggested outlier detector in action.

Asymmetric decile-based outlier detector, Part 1

Tue, 08 Mar 2022 00:00:00 +0000

In the previous post, I covered some problems with the outlier detector based on Tukey fences. Mainly, I discussed the probability of observing outliers using Tukey’s fences with different factors under different distributions. However, it’s not the only problem with this approach.

Since Tukey’s fences are based on quartiles, under multimodal distributions, we could get a situation when 50% of all sample elements are marked as outliers. Also, Tukey’s fences are designed for symmetric distributions, so we could get strange results with asymmetric distributions.

In this post, I want to suggest an asymmetric outlier detector based on deciles which mitigates this problem.

Probability of observing outliers using Tukey's fences

Tue, 01 Mar 2022 00:00:00 +0000

Tukey’s fences is one of the most popular simple outlier detectors for one-dimensional number arrays. This approach assumes that for a given sample, we calculate first and third quartiles ($Q_1$ and $Q_3$), and mark all the sample elements outside the interval

$$ [Q_1 - k (Q_3 - Q_1),\, Q_3 + k (Q_3 - Q_1)] $$

as outliers. Typical recommendation for $k$ is $1.5$ for “regular” outliers and $3.0$ for “far outliers”. Here is a box plot example for a sample taken from the standard normal distributions (sample size is $1000$):

As we can see, 11 elements were marked as outliers (shown as dots). Is it an expected result or not? The answer depends on your goals. There is no single definition of an outlier. In fact, the chosen outlier detector provides a unique outlier definition.

In my applications, I typically consider outliers as rare events that should be investigated. When I detect too many outliers, all such reports become useless noise. For example, on the above image, I wouldn’t treat any of the sample elements as outliers. However, If we add $10.0$ to this sample, this element is an obvious outlier (which will be the only one):

Thus, an important property of an outlier detector is the “false positive rate”: the percentage of samples with detected outliers which I wouldn’t treat as outliers. In this post, I perform numerical simulations that show the probability of observing outliers using Tukey’s fences with different $k$ values.

Gamma effect size powered by the middle non-zero quantile absolute deviation

Tue, 22 Feb 2022 00:00:00 +0000

In previous posts, I covered the concept of the gamma effect size. It’s a nonparametric effect size which is consistent with Cohen’s d under the normal distribution. However, the original definition has drawbacks: this statistic becomes zero if half of the sample elements are equal to each other. Last time, I suggested) a workaround for this problem: we can replace the median absolute deviation by the quantile absolute deviation. Unfortunately, this trick requires parameter tuning: we should choose a proper quantile position to make this approach work. Today I want to suggest a strategy that provides a way to make a generic choice: we can use the middle non-zero quantile absolute deviation.

Middle non-zero quantile absolute deviation

Tue, 15 Feb 2022 00:00:00 +0000

Median absolute deviation ($\operatorname{MAD}$) around the median is a popular robust measure of statistical dispersion. Unfortunately, if we work with discrete distributions, we could get zero $\operatorname{MAD}$ values. It could bring some problems if we use $\operatorname{MAD}$ as a denominator. Such a problem is also relevant to some other quantile-based measures of dispersion like interquartile range ($\operatorname{IQR}$).

This problem could be solved using the quantile absolute deviation around the median. However, it’s not always clear how to choose the right quantile to estimate. In this post, I’m going to suggest a choosing approach that is consistent with the classic $\operatorname{MAD}$ under continuous distributions (and samples without tied values).

Unbiased median absolute deviation based on the trimmed Harrell-Davis quantile estimator

Tue, 08 Feb 2022 00:00:00 +0000

The median absolute deviation ($\operatorname{MAD}$) is a robust measure of scale. For a sample $x = \{ x_1, x_2, \ldots, x_n \}$, it’s defined as follows:

$$ \operatorname{MAD}_n = C_n \cdot \operatorname{median}(|x - \operatorname{median}(x)|) $$

where $\operatorname{median}$ is a median estimator, $C_n$ is a scale factor. Using the right scale factor, we can use $\operatorname{MAD}$ as a consistent estimator for the estimation of the standard deviation under the normal distribution. For huge samples, we can use the asymptotic value of $C_n$ which is

$$ C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. $$

For small samples, we should use adjusted values $C_n$ which depend on the sample size. However, $C_n$ depends not only on the sample size but also on the median estimator. I have already covered how to obtain this values for the traditional median estimator and the Harrell-Davis median estimator. It’s time to get the $C_n$ values for the trimmed Harrell-Davis median estimator.

Median absolute deviation vs. Shamos estimator

Tue, 01 Feb 2022 00:00:00 +0000

There are multiple ways to estimate statistical dispersion. The standard deviation is the most popular one, but it’s not robust: a single outlier could heavily corrupt the results. Fortunately, we have robust measures of dispersions like the median absolute deviation and the Shamos estimator. In this post, we perform numerical simulations and compare these two estimators on different distributions and sample sizes.

Moving extended P² quantile estimator

Tue, 25 Jan 2022 00:00:00 +0000

In the previous posts, I discussed the P² quantile estimator (a sequential estimator which takes $O(1)$ memory and estimates a single predefined quantile), the moving P² quantile estimator (a moving modification of P² which estimates quantiles within the moving window), and the extended P² quantile estimator (a sequential estimator which takes $O(m)$ memory and estimates $m$ predefined quantiles).

Now it’s time to build the moving modification of the extended P² quantile estimator which estimates $m$ predefined quantiles using $O(m)$ memory within the moving window.

Extended P² quantile estimator

Tue, 18 Jan 2022 00:00:00 +0000

I already covered the P² quantile estimator and its possible implementation improvements in several blog posts. This sequential estimator uses $O(1)$ memory and allows estimating a single predefined quantile. Now it’s time to discuss the extended P² quantile estimator that allows estimating multiple predefined quantiles. This extended version was suggested in the paper “Simultaneous estimation of several percentiles”. In this post, we briefly discuss the approach from this paper and how we can improve its implementation.

P² quantile estimator marker adjusting order

Tue, 11 Jan 2022 00:00:00 +0000

I have already written a few blog posts about the P² quantile estimator (which is a sequential estimator that uses $O(1)$ memory):

In this post, we continue improving the P² implementation so that it gives better estimations for streams with a small number of elements.

P² quantile estimator initialization strategy

Tue, 04 Jan 2022 00:00:00 +0000

Update: the estimator accuracy could be improved using a bunch of patches.

The P² quantile estimator is a sequential estimator that uses $O(1)$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I have already written a few blog posts about it:

I tried this estimator in various contexts, and it shows pretty decent results. However, recently I stumbled on a corner case: if we want to estimate extreme quantile ($p < 0.1$ or $p > 0.9$), this estimator provides inaccurate results on small number streams ($n < 10$). While it looks like a minor issue, it would be nice to fix it. In this post, we briefly discuss choosing a better initialization strategy to workaround this problem.

Misleading geometric mean

Tue, 28 Dec 2021 00:00:00 +0000

There are multiple ways to compute the “average” value of an array of numbers. One of such ways is the geometric mean. For a sample $x = \{ x_1, x_2, \ldots, x_n \}$, the geometric means is defined as follows:

$$ \operatorname{GM}(x) = \sqrt[n]{x_1 x_2 \ldots x_n} $$

This approach is widely recommended for some specific tasks. Let’s say we want to compare the performance of two machines $M_x$ and $M_y$. In order to do this, we design a set of benchmarks $b = \{b_1, b_2, \ldots, b_n \}$ and obtain two sets of measurements $x = \{ x_1, x_2, \ldots, x_n \}$ and $y = \{ y_1, y_2, \ldots, y_n \}$. Once we have these two samples, we may have a desire to express the difference between two machines as a single number and get a conclusion like “Machine $M_y$ works two times faster than $M_x$.” I think that this approach is flawed because such a difference couldn’t be expressed as a single number: the result heavily depends on the workloads that we analyze. For example, imagine that $M_x$ is a machine with HDD and fast CPU, $M_y$ is a machine with SSD and slow CPU. In this case, $M_x$ could be faster on CPU-bound workloads while $M_y$ could be faster on disk-bound workloads. I really like this summary from “Notes on Calculating Computer Performance” by Bruce Jacob and Trevor Mudge (in the same paper, the authors criticize the approach with the geometric mean):

Performance is therefore not a single number, but really a collection of implications. It is nothing more or less than the measure of how much time we save running our tests on the machines in question. If someone else has similar needs to ours, our performance numbers will be useful to them. However, two people with different sets of criteria will likely walk away with two completely different performance numbers for the same machine.

However, some other authors (e.g., “How not to lie with statistics: the correct way to summarize benchmark results”) actually recommend using the geometric mean to get such a number that describes the performance ratio of $M_x$ and $M_y$. I have to admit that the geometric mean could provide a reasonable result in some simple cases. Indeed, on normalized numbers, it works much better than the arithmetic mean (that provides meaningless result) because of its nice property: $\operatorname{GM}(x_i/y_i) = \operatorname{GM}(x_i) / \operatorname{GM}(y_i)$. However, it doesn’t work properly in the general case. Firstly, the desire to express the difference between two machines is vicious: the result heavily depends on the chosen workloads. Secondly, the performance of a single benchmark $b_i$ couldn’t be described as a single number $x_i$: we should consider the whole performance distributions. In order to describe the difference between two distributions, we could consider the shift and ration functions (that work much better than the shift and ratio distributions).

Even if you consider a pretty homogenous set of benchmarks and all the distributions are pretty narrow, the geometric mean has severe drawbacks that you should keep in mind. In this post, I briefly cover some of these drawbacks and highlight problems that you may have if you use this metric.

Matching quantile sets using likelihood based on the binomial coefficients

Tue, 21 Dec 2021 00:00:00 +0000

Let’s say we have a distribution $X$ that is given by its $s$-quantile values:

$$ q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1}) $$

where $Q_X$ is the quantile function of $X$, $p_j = j / s$.

We also have a sample $y = \{y_1, y_2, \ldots, y_n \}$ that is given by its $s$-quantile estimations:

$$ q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}), $$

where $Q_y$ is the quantile estimation function for sample $y$. We also assume that $q_{y_0} = \min(y)$, $q_{y_s} = \max(y)$.

We want to know the likelihood of “$y$ is drawn from $X$”. In this post, I want to suggest a nice way to do this using the binomial coefficients.

Ratio function vs. ratio distribution

Tue, 14 Dec 2021 00:00:00 +0000

Let’s say we have two distributions $X$ and $Y$. In the previous post, we discussed how to express the “absolute difference” between them using the shift function and the shift distribution. Now let’s discuss how to express the “relative difference” between them. This abstract term also could be expressed in various ways. My favorite approach is to build the ratio function. In order to do this, for each quantile $p$, we should calculate $Q_Y(p)/Q_X(p)$ where $Q$ is the quantile function. However, some people prefer using the ratio distribution $Y/X$. While both approaches may provide similar results for narrow positive non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

Shift function vs. shift distribution

Tue, 07 Dec 2021 00:00:00 +0000

Let’s say we have two distributions $X$ and $Y$, and we want to express the “absolute difference” between them. This abstract term could be expressed in various ways. My favorite approach is to build the Doksum’s shift function. In order to do this, for each quantile $p$, we should calculate $Q_Y(p)-Q_X(p)$ where $Q$ is the quantile function. However, some people prefer using the shift distribution $Y-X$. While both approaches may provide similar results for narrow non-overlapping distributions, they are not equivalent in the general case. In this post, we briefly consider examples of both approaches.

Preprint announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'

Tue, 30 Nov 2021 00:00:00 +0000

Update: the final paper was published in Communications in Statistics - Simulation and Computation (DOI: 10.1080/03610918.2022.2050396).

Since the beginning of this year, I have been working on building a quantile estimator that provides an optimal trade-off between statistical efficiency and robustness. Finally, I have built such an estimator. A paper preprint is available on arXiv: arXiv:2111.11776 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-thdqe. You can cite it as follows:

Andrey Akinshin (2021) Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width, arXiv:2111.11776

Non-normal median sampling distribution

Tue, 23 Nov 2021 00:00:00 +0000

Let’s consider the classic sample median. If a sample is sorted and the number of sample elements is odd, the median is the middle element. In the case of an even number of sample elements, the median is an arithmetic average of the two middle elements.

Now let’s say we randomly take many samples from the same distribution and calculate the median for each of them. Next, we build a sampling distribution based on these median values. There is a well-known fact that this distribution is asymptotically normal with mean $M$ and variance $1/(4nf^2(M))$, where $n$ is the number of elements in samples, $f$ is the probability density function of the original distribution, and $M$ is the true median of the original distribution.

Unfortunately, if we try to build such sampling distributions in practice, we may see that they are not always normal. There are some corner cases that prevent us from using the normal model in general. If you implement general routines that analyze the median behavior, you should keep such cases in mind. In this post, we briefly talk about some of these cases.

Misleading kurtosis

Tue, 16 Nov 2021 00:00:00 +0000

I already discussed misleadingness of such metrics like standard deviation and skewness. It’s time to discuss misleadingness of the measure of tailedness: kurtosis (which, sometimes, could be incorrectly interpreted as a measure of peakedness). Typically, the concept of kurtosis is explained with the help of images like this:

Unfortunately, the raw kurtosis value may provide wrong insights about distribution properties. In this post, we briefly discuss the sources of its misleadingness:

There are multiple definitions of kurtosis. The most significant confusion arises between “kurtosis” and “excess kurtosis,” but there are other definitions of this measure.
Kurtosis may work fine for unimodal distributions, but it performs not so clear for multimodal distributions.
The classic definition of kurtosis is not robust: it could be easily spoiled by extreme outliers.

Misleading skewness

Tue, 09 Nov 2021 00:00:00 +0000

Skewness is a commonly used measure of the asymmetry of the probability distributions. A typical skewness interpretation comes down to an image like this:

It looks extremely simple: using the skewness sign, we get an idea of the distribution form and the arrangement of the mean and the median. Unfortunately, it doesn’t always work as expected. Skewness estimation could be a highly misleading metric (even more misleading than the standard deviation). In this post, I discuss four sources of its misleadingness:

“Skewness” is a generic term; it has multiple definitions. When a skewness value is presented, you can’t always guess the underlying equation without additional details.
Skewness is “designed” for unimodal distributions; it’s meaningless in the case of multimodality.
Most default skewness definitions are not robust: a single outlier could completely distort the skewness value.
We can’t make conclusions about the locations of the mean and the median based on the skewness sign.

Greenwald-Khanna quantile estimator

Tue, 02 Nov 2021 00:00:00 +0000

The Greenwald-Khanna quantile estimator is a classic sequential quantile estimator which has the following features:

It allows estimating quantiles with respect to the given precision $\epsilon$.
It requires $O(\frac{1}{\epsilon} log(\epsilon N))$ memory in the worst case.
It doesn’t require knowledge of the total number of elements in the sequence and the positions of the requested quantiles.

In this post, I briefly explain the basic idea of the underlying data structure, and share a copy-pastable C# implementation. At the end of the post, I discuss some important implementation decisions that are unclear from the original paper, but heavily affect the estimator accuracy.

P² quantile estimator rounding issue

Tue, 26 Oct 2021 00:00:00 +0000

Update: the estimator accuracy could be improved using a bunch of patches.

The P² quantile estimator is a sequential estimator that uses $O(1)$ memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I already wrote a blog post about this approach and added its implementation in perfolizer. Recently, I got a bug report that revealed a flaw of the original paper. In this post, I’m going to briefly discuss this issue and the corresponding fix.

Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width

Tue, 19 Oct 2021 00:00:00 +0000

Traditional quantile estimators that are based on one or two order statistics are a common way to estimate distribution quantiles based on the given samples. These estimators are robust, but their statistical efficiency is not always good enough. A more efficient alternative is the Harrell-Davis quantile estimator which uses a weighted sum of all order statistics. Whereas this approach provides more accurate estimations for the light-tailed distributions, it’s not robust. To be able to customize the trade-off between statistical efficiency and robustness, we could consider a trimmed modification of the Harrell-Davis quantile estimator. In this approach, we discard order statistics with low weights according to the highest density interval of the beta distribution.

Optimal window of the trimmed Harrell-Davis quantile estimator, Part 2: Trying Planck-taper window

Tue, 12 Oct 2021 00:00:00 +0000

In the previous post, I discussed the problem of non-smooth quantile-respectful density estimation (QRDE) which is generated by the trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width. I assumed that non-smoothness was caused by a non-smooth rectangular window which was used to build the truncated beta distribution. In this post, we are going to try another option: the Planck-taper window.

Optimal window of the trimmed Harrell-Davis quantile estimator, Part 1: Problems with the rectangular window

Tue, 05 Oct 2021 00:00:00 +0000

In the previous post, we have obtained a nice version of the trimmed Harrell-Davis quantile estimator which provides an opportunity to get a nice trade-off between robustness and statistical efficiency of quantile estimations. Unfortunately, it has a severe drawback. If we build a quantile-respectful density estimation based on the suggested estimator, we won’t get a smooth density function as in the case of the classic Harrell-Davis quantile estimator:

In this blog post series, we are going to find a way to improve the trimmed Harrell-Davis quantile estimator so that it gives a smooth density function and keeps its advantages in terms of robustness and statistical efficiency.

Beta distribution highest density interval of the given width

Tue, 28 Sep 2021 00:00:00 +0000

In one of the previous posts, I discussed the idea of the trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width. Since the Harrell-Davis quantile estimator uses the Beta distribution, we should be able to find the beta distribution highest density interval of the given width. In this post, I will show how to do this.

Quantile estimators based on k order statistics, Part 8: Winsorized Harrell-Davis quantile estimator

Tue, 21 Sep 2021 00:00:00 +0000

In the previous post, we have discussed the trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of size $\sqrt{n}/n$. This quantile estimator showed a decent level of statistical efficiency. However, the research wouldn’t be complete without comparison with the winsorized modification. Let’s fix it!

Quantile estimators based on k order statistics, Part 7: Optimal threshold for the trimmed Harrell-Davis quantile estimator

Tue, 14 Sep 2021 00:00:00 +0000

In the previous post, we have obtained a nice quantile estimator. To be specific, we considered a trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of the given size. The interval size is a parameter that controls the trade-off between statistical efficiency and robustness. While it’s nice to have the ability to control this trade-off, there is also a need for the default value, which could be used as a starting point when we have neither estimator breakdown point requirements nor prior knowledge about distribution properties.

After a series of unsuccessful attempts, it seems that I have found an acceptable solution. We should build the new estimator based on $\sqrt{n}/n$ order statistics. In this post, I’m going to briefly explain the idea behind the suggested estimator and share some numerical simulations that compare the proposed estimator and the classic Harrell-Davis quantile estimator.

Quantile estimators based on k order statistics, Part 6: Continuous trimmed Harrell-Davis quantile estimator

Tue, 07 Sep 2021 00:00:00 +0000

In my previous post, I tried the idea of using the trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of the given width. The width was defined so that it covers exactly k order statistics (the width equals $(k-1)/n$). I was pretty satisfied with the result and decided to continue evolving this approach. While “k order statistics” is a good mental model that described the trimmed interval, it doesn’t actually require an integer k. In fact, we can use any real number as the trimming percentage.

In this post, we are going to perform numerical simulations that check the statistical efficiency of the trimmed Harrell-Davis quantile estimator with different trimming percentages.

Quantile estimators based on k order statistics, Part 5: Improving trimmed Harrell-Davis quantile estimator

Tue, 31 Aug 2021 00:00:00 +0000

During the last several months, I have been experimenting with different variations of the trimmed Harrell-Davis quantile estimator. My original idea of using the highest density interval based on the fixed area percentage (e.g., HDI 95% or HDI 99%) led to a set of problems with overtrimming. I tried to solve them with manually customized trimming strategy, but this approach turned out to be too inconvenient; it was too hard to come up with optimal thresholds. One of the main problems was about the suboptimal number of elements that we actually aggregate to obtain the quantile estimation. So, I decided to try an approach that involves exactly k order statistics. The idea was so promising, but numerical simulations haven’t shown the appropriate efficiency level.

This bothered me the whole week. It sounded so reasonable to trim the Harrell-Davis quantile estimator using exactly k order statistics. Why didn’t this work as expected? Finally, I have found a fatal flaw in my previous approach: while it was a good idea to fix the size of the trimming window, I mistakenly chose its location following the equation from the Hyndman-Fan Type 7 quantile estimator!

In this post, we fix this problem and try another modification of the trimmed Harrell-Davis quantile estimator based on k order statistics and highest density intervals at the same time.

Quantile estimators based on k order statistics, Part 4: Adopting trimmed Harrell-Davis quantile estimator

Tue, 24 Aug 2021 00:00:00 +0000

In the previous posts, I discussed various aspects of quantile estimators based on k order statistics. I already tried a few weight functions that aggregate the sample values to the quantile estimators (see posts about an extension of the Hyndman-Fan Type 7 equation and about adjusted regularized incomplete beta function). In this post, I continue my experiments and try to adopt the trimmed modifications of the Harrell-Davis quantile estimator to this approach.

Quantile estimators based on k order statistics, Part 3: Playing with the Beta function

Tue, 17 Aug 2021 00:00:00 +0000

In the previous two posts, I discussed the idea of quantile estimators based on k order statistics. A already covered the motivation behind this idea and the statistical efficiency of such estimators using the extended Hyndman-Fan equations as a weight function. Now it’s time to experiment with the Beta function as a primary way to aggregate k order statistics into a single quantile estimation!

Quantile estimators based on k order statistics, Part 2: Extending Hyndman-Fan equations

Tue, 10 Aug 2021 00:00:00 +0000

In the previous post, I described the idea of using quantile estimators based on k order statistics. Potentially, such estimators could be more robust than estimators based on all samples elements (like Harrell-Davis, Sfakianakis-Verginis, or Navruz-Özdemir) and more statistically efficient than traditional quantile estimators (based on 1 or 2 order statistics). Moreover, we should be able to control this trade-off based on the business requirements (e.g., setting the desired breakdown point).

The only challenging thing here is choosing the weight function that aggregates k order statistics to a single quantile estimation. We are going to try several options, perform Monte-Carlo simulations for each of them, and compare the results. A reasonable starting point is an extension of the traditional quantile estimators. In this post, we are going to extend the Hyndman-Fan Type 7 quantile estimator (nowadays, it’s one of the most popular estimators). It estimates quantiles as a linear interpolation of two subsequent order statistics. We are going to make some modifications, so a new version is going to be based on k order statistics.

Spoiler: this approach doesn’t seem like an optimal one. I’m pretty disappointed with its statistical efficiency on samples from light-tailed distributions. So, what’s the point of writing a blog post about an inefficient approach? Because of the following reasons:

I believe it’s crucial to share negative results. Sometimes, knowledge about approaches that don’t work could be more important than knowledge about more effective techniques. Negative results give you a broader view of the problem and protect you from wasting your time on potential promising (but not so useful) ideas.
Negative results improve research completeness. When we present an approach, it’s essential to not only show why it solves problems well, but also why it solves problems better than other similar approaches.
While I wouldn’t recommend my extension of the Hyndman-Fan Type 7 quantile estimator to the k order statistics case as the default quantile estimator, there are some specific cases where it could be useful. For example, if we estimate the median based on small samples from a symmetric light-tailed distribution, it could outperform not only the original version but also the Harrell-Davis quantile estimator. The “negativity” of the negative results always exists in a specific context. So, there may be cases when negative results for the general case transform to positive results for a particular niche problem.
Finally, it’s my personal blog, so I have the freedom to write on any topic I like. My blog posts are not publications to scientific journals (which typically don’t welcome negative results), but rather research notes about conducted experiments. It’s important for me to keep records of all the experiments I perform regardless of the usefulness of the results.

So, let’s briefly look at the results of this not-so-useful approach.

Quantile estimators based on k order statistics, Part 1: Motivation

Tue, 03 Aug 2021 00:00:00 +0000

It’s not easy to choose a good quantile estimator. In my previous posts, I considered several groups of quantile estimators:

Quantile estimators based 1 or 2 order statistics (Hyndman-Fan Type1-9)
Quantile estimators based on all order statistics (the Harrell-Davis quantile estimator, the Sfakianakis-Verginis quantile estimator, and the Navruz-Özdemir quantile estimator)
Quantile estimators based on a variable number of order statistics (the trimmed and winsorized modifications of the Harrell-Davis quantile estimator)

Unfortunately, all of these estimators have significant drawbacks (e.g., poor statistical efficiency or poor robustness). In this post, I want to discuss all of the advantages and disadvantages of each approach and suggest another family of quantile estimators that are based on k order statistics.

Avoiding over-trimming with the trimmed Harrell-Davis quantile estimator

Tue, 27 Jul 2021 00:00:00 +0000

Previously, I already discussed the trimmed modification of the Harrell-Davis quantile estimator several times. I performed several numerical simulations that compare the statistical efficiency of this estimator with the efficiency of the classic Harrell-Davis quantile estimator (HDQE) and its winsorized modification; I showed how we can improve the efficiency using custom trimming strategies and how to choose a good trimming threshold value.

In the heavy-tailed cases, the trimmed HDQE provides better estimations than the classic HDQE because of its higher breakdown point. However, in the light-tailed cases, we could get efficiency that is worse than the baseline Hyndman-Fan Type 7 (HF7) quantile estimator. In many cases, such an effect arises because of the over-trimming effect. If the trimming percentage is too high or if the evaluated quantile is too far from the median, the trimming strategy based on the highest-density interval may lead to an estimation that is based on single order statistics. In this case, we get an efficiency level similar to the Hyndman-Fan Type 1-3 quantile estimators (which are also based on single order statistics). In the light-tailed case, such a result is less preferable than Hyndman-Fan Type 4-9 quantile estimators (which are based on two subsequent order statistics).

In order to improve the situation, we could introduce the lower bound for the number of order statistics that contribute to the final quantile estimations. In this post, I look at some numerical simulations that compare trimmed HDQEs with different lower bounds.

Optimal threshold of the trimmed Harrell-Davis quantile estimator

Tue, 20 Jul 2021 00:00:00 +0000

The traditional quantile estimators (which are based on 1 or 2 order statistics) have great robustness. However, the statistical efficiency of these estimators is not so great. The Harrell-Davis quantile estimator has much better efficiency (at least in the light-tailed case), but it’s not robust (because it calculates a weighted sum of all sample values). I already wrote a post about trimmed Harrell-Davis quantile estimator: this approach suggest dropping some of the low-weight sample values to improve robustness (keeping good statistical efficiency). I also perform a numerical simulations that compare efficiency of the original Harrell-Davis quantile estimator against its trimmed and winsorized modifications. It’s time to discuss how to choose the optimal trimming threshold and how it affects the estimator efficiency.

Estimating quantile confidence intervals: Maritz-Jarrett vs. jackknife

Tue, 13 Jul 2021 00:00:00 +0000

When it comes to estimating quantiles of the given sample, my estimator of choice is the Harrell-Davis quantile estimator (to be more specific, its trimmed version). If I need to get a confidence interval for the obtained quantiles, I use the Maritz-Jarrett method because it provides a decent coverage percentage. Both approaches work pretty nicely together.

However, in the original paper by Harrell and Davis (1982), the authors suggest using the jackknife variance estimator in order to get the confidence intervals. The obvious question here is which approach better: the Maritz-Jarrett method or the jackknife estimator? In this post, I perform a numerical simulation that compares both techniques using different distributions.

Using Kish's effective sample size with weighted quantiles

Tue, 06 Jul 2021 00:00:00 +0000

In my previous posts, I described how to calculate weighted quantiles and their confidence intervals using the Harrell-Davis quantile estimator. This powerful technique allows applying quantile exponential smoothing and dispersion exponential smoothing for time series in order to get its moving properties.

When we work with weighted samples, we need a way to calculate the effective samples size. Previously, I used the sum of all weights normalized by the maximum weight. In most cases, it worked OK.

Recently, Ben Jann pointed out that it would be better to use the Kish’s formula to calculate the effective sample size. In this post, you find the formula and a few numerical simulations that illustrate the actual impact of the underlying sample size formula.

Partial binning compression of performance series

Tue, 29 Jun 2021 00:00:00 +0000

Let’s start with a problem from real life. Imagine we have thousands of application components that should be initialized. We care about the total initialization time of the whole application, so we want to automatically track the slowest components using a continuous integration (CI) system. The easiest way to do it is to measure the initialization time of each component in each CI build and save all the measurements to a database. Unfortunately, if the total number of components is huge, the overall artifact size may be quite extensive. Thus, this approach may introduce an unwanted negative impact on the database size and data processing time.

However, we don’t actually need all the measurements. We want to track only the slowest components. Typically, it’s possible to introduce a reasonable threshold that defines such components. For example, we can say that all components that are initialized in less than 1ms are “fast enough,” so there is no need to know the exact initialization time for them. Since these time values are insignificant, we can just omit all the measurements below the given thresholds. This allows to significantly reduce the data traffic without losing any important information.

The suggested trick can be named partial binning compression. Indeed, we introduce a single bin (perform binning) and omit all the values inside this bin (perform compression). On the other hand, we don’t build an honest histogram since we keep all the raw values outside the given bin (the binning is partial).

Let’s discuss a few aspects of using partial binning compression.

Calculating gamma effect size for samples with zero median absolute deviation

Tue, 22 Jun 2021 00:00:00 +0000

In previous posts, I discussed the gamma effect size which is a Cohen’s d-consistent nonparametric and robust measure of the effect size. Also, I discussed various ways to customize this metric and adjust it to different kinds of business requirements. In this post, I want to briefly cover one more corner case that requires special adjustments. We are going to discuss the situation when the median absolute deviation is zero.

Discrete performance distributions

Tue, 15 Jun 2021 00:00:00 +0000

When we collect software performance measurements, we get a bunch of time intervals. Typically, we tend to interpret time values as continuous values. However, the obtained values are actually discrete due to the limited resolution of our measurement tool. In simple cases, we can treat these discrete values as continuous and get meaningful results. Unfortunately, discretization may produce strange phenomena like pseudo-multimodality or zero dispersion. If we want to set up a reliable system that automatically analyzes such distributions, we should be aware of such problems so we could correctly handle them.

In this post, I want to share a few of discretization problems in real-life performance data sets (based on the Rider performance tests).

Customization of the nonparametric Cohen's d-consistent effect size

Tue, 08 Jun 2021 00:00:00 +0000

One year ago, I publish a post called Nonparametric Cohen's d-consistent effect size. During this year, I got a lot of internal and external feedback from my own statistical experiments and people who tried to use the suggested approach. It seems that the nonparametric version of Cohen’s d works much better with real-life not-so-normal data. While the classic Cohen’s d based on the non-robust arithmetic mean and the non-robust standard deviation can be easily corrupted by a single outlier, my approach is much more resistant to unexpected extreme values. Also, it allows exploring the difference between specific quantiles of considered samples, which can be useful in the non-parametric case.

However, I wasn’t satisfied with the results of all of my experiments. While I still like the basic idea (replace the mean with the median; replace the standard deviation with the median absolute deviation), it turned out that the final results heavily depend on the used quantile estimator. To be more specific, the original Harrell-Davis quantile estimator is not always optimal; in most cases, it’s better to replace it with its trimmed modification. However, the particular choice of the quantile estimators depends on the situation. Also, the consistency constant for the median absolute deviation should be adjusted according to the current sample size and the used quantile estimator. Of course, it also can be replaced by other dispersion estimators that can be used as consistent estimators of the standard deviation.

In this post, I want to get a brief overview of possible customizations of the suggested metrics.

Robust alternative to statistical efficiency

Tue, 01 Jun 2021 00:00:00 +0000

Statistical efficiency is a common measure of the quality of an estimator. Typically, it’s expressed via the mean square error ($\operatorname{MSE}$). For the given estimator $T$ and the true parameter value $\theta$, the $\operatorname{MSE}$ can be expressed as follows:

$$ \operatorname{MSE}(T) = \operatorname{E}[(T-\theta)^2] $$

In numerical simulations, the $\operatorname{MSE}$ can’t be used as a robust metric because its breakdown point is zero (a corruption of a single measurement leads to a corrupted result). Typically, it’s not a problem for light-tailed distributions. Unfortunately, in the heavy-tailed case, the $\operatorname{MSE}$ becomes an unreliable and unreproducible metric because it can be easily spoiled by a single outlier.

I suggest an alternative way to compare statistical estimators. Instead of using non-robust $\operatorname{MSE}$, we can use robust quantile estimations of the absolute error distribution. In this post, I want to share numerical simulations that show a problem of irreproducible $\operatorname{MSE}$ values and how they can be replaced by reproducible quantile values.

Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies

Tue, 25 May 2021 00:00:00 +0000

Let’s say we want to estimate the median based on a small sample (3 $\leq n \leq 7$) from a right-skewed heavy-tailed distribution with high statistical efficiency.

The traditional median estimator is the most robust estimator, but it’s not the most efficient one. Typically, the Harrell-Davis quantile estimator provides better efficiency, but it’s not robust (its breakdown point is zero), so it may have worse efficiency in the given case. The winsorized and trimmed modifications of the Harrell-Davis quantile estimator provide a good trade-off between efficiency and robustness, but they require a proper winsorizing/trimming rule. A reasonable choice of such a rule for medium-size samples is based on the highest density interval of the Beta function (as described here). Unfortunately, this approach may be suboptimal for small samples. E.g., if we use the 99% highest density interval to estimate the median, it starts to trim sample values only for $n \geq 8$.

In this post, we are going to discuss custom winsorizing/trimming strategies for special cases of the quantile estimation problem.

Comparing the efficiency of the Harrell-Davis, Sfakianakis-Verginis, and Navruz-Özdemir quantile estimators

Tue, 18 May 2021 00:00:00 +0000

In the previous posts, I discussed the statistical efficiency of different quantile estimators (Efficiency of the Harrell-Davis quantile estimator and Efficiency of the winsorized and trimmed Harrell-Davis quantile estimators).

In this post, I continue this research and compare the efficiency of the Harrell-Davis quantile estimator, the Sfakianakis-Verginis quantile estimators, and the Navruz-Özdemir quantile estimator.

Dispersion exponential smoothing

Tue, 11 May 2021 00:00:00 +0000

In this previous post, I showed how to apply exponential smoothing to quantiles using the weighted Harrell-Davis quantile estimator. This technique allows getting smooth and stable moving median estimations. In this post, I’m going to discuss how to use the same approach to estimate moving dispersion.

Quantile exponential smoothing

Tue, 04 May 2021 00:00:00 +0000

One of the popular problems in time series analysis is estimating the moving “average” value. Let’s define the “average” as a central tendency metric like the mean or the median. When we talk about the moving value, we assume that we are interested in the average value “at the end” of the time series instead of the average of all available observations.

One of the most straightforward approaches to estimate the moving average is the simple moving mean. Unfortunately, this approach is not robust: outliers can instantly spoil the evaluated mean value. As an alternative, we can consider simple moving median. I already discussed a few of such methods: the MP² quantile estimator and a moving quantile estimator based on partitioning heaps (a modification of the Hardle-Steiger method). When we talk about simple moving averages, we typically assume that we estimate the average value over the last $k$ observations ($k$ is the window size). This approach is also known as unweighted moving averages because all target observations have the same weight.

As an alternative to the simple moving average, we can also consider the weighted moving average. In this case, we assign a weight for each observation and aggregate the whole time series according to these weights. A famous example of such a weight function is exponential smoothing. And the simplest form of exponential smoothing is the exponentially weighted moving mean. This approach estimates the weighted moving mean using exponentially decreasing weights. Switching from the simple moving mean to the exponentially weighted moving mean provides some benefits in terms of smoothness and estimation efficiency.

Although exponential smoothing has advantages over the simple moving mean, it still estimates the mean value which is not robust. We can improve the robustness of this approach if we reuse the same idea for weighted moving quantiles. It’s possible because the quantiles also can be estimated for weighted samples. In one of my previous posts, I showed how to adapt the Hyndman-Fan Type 7 and Harrell-Davis quantile estimators to the weighted samples. In this post, I’m going to show how we can use this technique to estimate the weighted moving quantiles using exponentially decreasing weights.

Improving quantile-respectful density estimation for discrete distributions using jittering

Tue, 27 Apr 2021 00:00:00 +0000

In my previous posts, I already discussed the problem that arises when we try to build kernel density estimation (KDE) for samples with ties. We may get such samples in real life from discrete or mixed discrete/continuous distributions. Even if the original distribution is continuous, we may observe artificial sample discretization due to the limited resolution of the measuring tool. Such discretization may lead to inaccurate density plots due to undersmoothing. The problem can be resolved using a nice technique called jittering. I also discussed how to apply jittering to get a smoother version of KDE.

However, I’m not a huge fan of KDE because of two reasons. The first one is the problem of choosing a proper bandwidth value. With poorly chosen bandwidth, we can easily get oversmoothing or undersmoothing even without the discretization problem. The second one is an inconsistency between the KDE-based probability density function and evaluated sample quantiles. It could lead to inconsistent visualizations (e.g., KDE-based violin plots with non-KDE-based quantile values) or it could introduce problems for algorithms that require density function and quantile values at the same time. The inconsistency could be resolved using quantile-respectful density estimation (QRDE). This kind of estimation builds the density function which matches the evaluated sample quantiles. To get a smooth QRDE, we also need a smooth quantile estimator like the Harrell-Davis quantile estimator. The robustness and componential efficiency of this approach can be improved using the winsorized and trimmed modifications of the Harrell-Davis quantile estimator (which also have a decent statistical efficiency level).

Unfortunately, the straightforward QRDE calculation is not always applicable for samples with ties because it’s impossible to build an “honest” density function for discrete distributions without using the Dirac delta function. This is a severe problem for QRDE-based algorithms like the lowland multimodality detection algorithm. In this post, I will show how jittering could help to solve this problem and get a smooth QRDE on samples with ties.

How to build a smooth density estimation for a discrete sample using jittering

Tue, 20 Apr 2021 00:00:00 +0000

Update (2024-03-19): A better approach is presented in A better jittering approach for discretization acknowledgment in density estimation

Let’s say you have a sample with tied values. If you draw a kernel density estimation (KDE) for such a sample, you may get a serrated pattern like this:

KDE requires samples from continuous distributions while tied values arise in discrete or mixture distributions. Even if the original distribution is continuous, you may observe artificial sample discretization due to the limited resolution of the measuring tool. This effect may lead to distorted density plots like in the above picture.

The problem could be solved using a nice technique called jittering. In the simplest case, jittering just adds random noise to each measurement. Such a trick removes all ties from the sample and allows building a smooth density estimation.

However, there are many different ways to apply jittering. The trickiest question here is how to choose proper noise values. In this post, I want to share one of my favorite jittering approaches. It generates a non-randomized noise pattern with a low risk of noticeable sample corruption.

Kernel density estimation and discrete values

Tue, 13 Apr 2021 00:00:00 +0000

Kernel density estimation (KDE) is a popular technique of data visualization. Based on the given sample, it allows estimating the probability density function (PDF) of the underlying distribution. Here is an example of KDE for x = {3.82, 4.61, 4.89, 4.91, 5.31, 5.6, 5.66, 7.00, 7.00, 7.00} (normal kernel, Sheather & Jones bandwidth selector):

KDE is a simple and straightforward way to build a PDF, but it’s not always the best one. In addition to my concerns about bandwidth selection, continuous use of KDE creates an illusion that all distributions are smooth and continuous. In practice, it’s not always true.

In the above picture, the distribution looks pretty continuous. However, the picture hides the fact that we have three 7.00 elements in the original sample. With continuous distributions, the probability of getting tied observations (that have the same value) is almost zero. If a sample contains ties, we are most likely working with either a discrete distribution or a mixture of discrete and continuous distributions. A KDE for such a sample may significantly differ from the actual PDF. Thus, this technique may mislead us instead of providing insights about the true underlying distribution.

In this post, we discuss the usage of PDF and PMF with continuous and discrete distributions. Also, we look at examples of corrupted density estimation plots for distributions with discrete features.

Efficiency of the winsorized and trimmed Harrell-Davis quantile estimators

Tue, 06 Apr 2021 00:00:00 +0000

In previous posts, I suggested two modifications of the Harrell-Davis quantile estimator: winsorized and trimmed. Both modifications have a higher level of robustness in comparison to the original estimator. Also, I discussed the efficiency of the Harrell-Davis quantile estimator. In this post, I’m going to continue numerical simulation and estimate the efficiency of the winsorized and trimmed modifications.

Trimmed modification of the Harrell-Davis quantile estimator

Tue, 30 Mar 2021 00:00:00 +0000

In one of the previous posts, I discussed winsorized Harrell-Davis quantile estimator. This estimator is more robust than the classic Harrell-Davis quantile estimator. In this post, I want to suggest another modification that may be better for some corner cases: the trimmed Harrell-Davis quantile estimator.

Efficiency of the Harrell-Davis quantile estimator

Tue, 23 Mar 2021 00:00:00 +0000

One of the most essential properties of a quantile estimator is its efficiency. In simple words, the efficiency describes the estimator accuracy. The Harrell-Davis quantile estimator is a good option to achieve higher efficiency. However, this estimator may provide lower efficiency in some special cases. In this post, we will conduct a set of simulations that show the actual efficiency numbers. We compare different distributions (symmetric and right-skewed, heavy-tailed and light-tailed), quantiles, and sample sizes.

Navruz-Özdemir quantile estimator

Tue, 16 Mar 2021 00:00:00 +0000

The Navruz-Özdemir quantile estimator suggests the following equation to estimate the $p^\textrm{th}$ quantile of sample $X$:

$$ \begin{split} \operatorname{NO}_p = & \Big( (3p-1)X_{(1)} + (2-3p)X_{(2)} - (1-p)X_{(3)} \Big) B_0 +\\ & +\sum_{i=1}^n \Big((1-p)B_{i-1}+pB_i\Big)X_{(i)} +\\ & +\Big( -pX_{(n-2)} + (3p-1)X_{(n-1)} + (2-3p)X_{(n)} \Big) B_n \end{split} $$

where $B_i = B(i; n, p)$ is probability mass function of the binomial distribution $B(n, p)$, $X_{(i)}$ are order statistics of sample $X$.

In this post, I derive these equations following the paper “A new quantile estimator with weights based on a subsampling approach” (2020) by Gözde Navruz and A. Fırat Özdemir. Also, I add some additional explanations, simplify the final equation, and provide reference implementations in C# and R.

Sfakianakis-Verginis quantile estimator

Tue, 09 Mar 2021 00:00:00 +0000

There are dozens of different ways to estimate quantiles. One of these ways is to use the Sfakianakis-Verginis quantile estimator. To be more specific, it’s a family of three estimators. If we want to estimate the $p^\textrm{th}$ quantile of sample $X$, we can use one of the following equations:

$$ \begin{split} \operatorname{SV1}_p =& \frac{B_0}{2} \big( X_{(1)}+X_{(2)}-X_{(3)} \big) + \sum_{i=1}^{n} \frac{B_i+B_{i-1}}{2} X_{(i)} + \frac{B_n}{2} \big(- X_{(n-2)}+X_{(n-1)}-X_{(n)} \big),\\ \operatorname{SV2}_p =& \sum_{i=1}^{n} B_{i-1} X_{(i)} + B_n \cdot \big(2X_{(n)} - X_{(n-1)}\big),\\ \operatorname{SV3}_p =& \sum_{i=1}^n B_i X_{(i)} + B_0 \cdot \big(2X_{(1)}-X_{(2)}\big). \end{split} $$

where $B_i = B(i; n, p)$ is probability mass function of the binomial distribution $B(n, p)$, $X_{(i)}$ are order statistics of sample $X$.

In this post, I derive these equations following the paper “A new family of nonparametric quantile estimators” (2008) by Michael E. Sfakianakis and Dimitris G. Verginis. Also, I add some additional explanations, reconstruct missing steps, simplify the final equations, and provide reference implementations in C# and R.

Winsorized modification of the Harrell-Davis quantile estimator

Tue, 02 Mar 2021 00:00:00 +0000

The Harrell-Davis quantile estimator is one of my favorite quantile estimators because of its efficiency. It has a small mean square error which allows getting accurate estimations. However, it has a severe drawback: it’s not robust. Indeed, since the estimator includes all sample elements with positive weights, its breakdown point is zero.

In this post, I want to suggest modifications of the Harrell-Davis quantile estimator which increases its robustness keeping almost the same level of efficiency.

Misleading standard deviation

Tue, 23 Feb 2021 00:00:00 +0000

The standard deviation may be an extremely misleading metric. Even minor deviations from normality could make it completely unreliable and deceiving. Let me demonstrate this problem using an example.

Below you can see three density plots of some distributions. Could you guess their standard deviations?

The correct answers are $1.0, 3.0, 11.0$. And here is a more challenging problem: could you match these values with the corresponding distributions?

Unbiased median absolute deviation based on the Harrell-Davis quantile estimator

Tue, 16 Feb 2021 00:00:00 +0000

The median absolute deviation ($\textrm{MAD}$) is a robust measure of scale. In the previous post, I showed how to use the unbiased version of the $\textrm{MAD}$ estimator as a robust alternative to the standard deviation. “Unbiasedness” means that such estimator’s expected value equals the true value of the standard deviation. Unfortunately, there is such thing as the bias–variance tradeoff: when we remove the bias of the $\textrm{MAD}$ estimator, we increase its variance and mean squared error ($\textrm{MSE}$).

In this post, I want to suggest a more efficient unbiased $\textrm{MAD}$ estimator. It’s also a consistent estimator for the standard deviation, but it has smaller $\textrm{MSE}$. To build this estimator, we should replace the classic “straightforward” median estimator with the Harrell-Davis quantile estimator and adjust bias-correction factors. Let’s discuss this approach in detail.

Unbiased median absolute deviation

Tue, 09 Feb 2021 00:00:00 +0000

The median absolute deviation ($\textrm{MAD}$) is a robust measure of scale. For distribution $X$, it can be calculated as follows:

$$ \textrm{MAD} = C \cdot \textrm{median}(|X - \textrm{median}(X)|) $$

where $C$ is a constant scale factor. This metric can be used as a robust alternative to the standard deviation. If we want to use the $\textrm{MAD}$ as a consistent estimator for the standard deviation under the normal distribution, we should set

$$ C = C_{\infty} = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. $$

where $\Phi^{-1}$ is the quantile function of the standard normal distribution (or the inverse of the cumulative distribution function). If $X$ is the normal distribution, we get $\textrm{MAD} = \sigma$ where $\sigma$ is the standard deviation.

Now let’s consider a sample $x = \{ x_1, x_2, \ldots x_n \}$. Let’s denote the median absolute deviation for a sample of size $n$ as $\textrm{MAD}_n$. The corresponding equation looks similar to the definition of $\textrm{MAD}$ for a distribution:

$$ \textrm{MAD}_n = C_n \cdot \textrm{median}(|x - \textrm{median}(x)|). $$

Let’s assume that $\textrm{median}$ is the straightforward definition of the median (if $n$ is odd, the median is the middle element of the sorted sample, if $n$ is even, the median is the arithmetic average of the two middle elements of the sorted sample). We still can use $C_n = C_{\infty}$ for extremely large sample sizes. However, for small $n$, $\textrm{MAD}_n$ becomes a biased estimator. If we want to get an unbiased version, we should adjust the value of $C_n$.

In this post, we look at the possible approaches and learn the way to get the exact value of $C_n$ that makes $\textrm{MAD}_n$ unbiased estimator of the median absolute deviation for any $n$.

Comparing distribution quantiles using gamma effect size

Tue, 02 Feb 2021 00:00:00 +0000

There are several ways to describe the difference between two distributions. Here are a few examples:

Effect sizes based on differences between means (e.g., Cohen’s d, Glass’ Δ, Hedges’ g)
The shift and ration functions that estimate differences between matched quantiles.

In one of the previous post, I described the gamma effect size which is defined not for the mean but for quantiles. In this post, I want to share a few case studies that demonstrate how the suggested metric combines the advantages of the above approaches.

A single outlier could completely distort your Cohen's d value

Tue, 26 Jan 2021 00:00:00 +0000

Cohen’s d is a popular way to estimate the effect size between two samples. It works excellent for perfectly normal distributions. Usually, people think that slight deviations from normality shouldn’t produce a noticeable impact on the result. Unfortunately, it’s not always true. In fact, a single outlier value can completely distort the result even in large samples.

In this post, I will present some illustrations for this problem and will show how to fix it.

Better moving quantile estimations using the partitioning heaps

Tue, 19 Jan 2021 00:00:00 +0000

In one of the previous posts, I have discussed the Hardle-Steiger method. This algorithm allows estimating the moving median using $O(L)$ memory and $O(log(L))$ element processing complexity (where $L$ is the window size). Also, I have shown how to adapt this approach to estimate any moving quantile.

In this post, I’m going to present further improvements. The Hardle-Steiger method always returns the order statistics which is the $k\textrm{th}$ smallest element from the sample. It means that the estimated quantile value always equals one of the last $L$ observed numbers. However, many of the classic quantile estimators use two elements. For example, if we want to estimate the median for $x = \{4, 5, 6, 7\}$, some estimators return $5.5$ (which is the arithmetical mean of $5$ and $6$) instead of $5$ or $6$ (which are order statistics).

Let’s learn how to implement a moving version of such estimators using the partitioning heaps from the Hardle-Steiger method.

MP² quantile estimator: estimating the moving median without storing values

Tue, 12 Jan 2021 00:00:00 +0000

In one of the previous posts, I described the P² quantile estimator. It allows estimating quantiles on a stream of numbers without storing them. Such sequential (streaming/online) quantile estimators are useful in software telemetry because they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.

After the publication, I got a lot of questions about moving sequential quantile estimators. Such estimators return quantile values not for the whole stream of numbers, but only for the recent values. So, I wrote another post about a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method). This algorithm gives you the exact value of any order statistics for the last $L$ numbers ($L$ is known as the window size). However, it requires $O(L)$ memory, and it takes $O(log(L))$ time to process each element. This may be acceptable in some cases. Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large $L$.

In this post, I’m going to present a moving modification of the P² quantile estimator. Let’s call it MP² (moving P²). It requires $O(1)$ memory, it takes $O(1)$ to process each element, and it supports windows of any size. Of course, we have a trade-off with the estimation accuracy: it returns a quantile approximation instead of the exact order statistics. However, in most cases, the MP² estimations are pretty accurate from the practical point of view.

Let’s discuss MP² in detail!

Case study: Accuracy of the MAD estimation using the Harrell-Davis quantile estimator (Gumbel distribution)

Tue, 05 Jan 2021 00:00:00 +0000

In some of my previous posts, I used the median absolute deviation (MAD) to describe the distribution dispersion:

The MAD estimation depends on the chosen median estimator: we may get different MAD values with different median estimators. To get better accuracy, I always encourage readers to use the Harrell-Davis quantile estimator instead of the classic Type 7 quantile estimator.

In this case study, I decided to compare these two quantile estimators using the Gumbel distribution (it’s a good model for slightly right-skewed distributions). According to the performed Monte Carlo simulation, the Harrell-Davis quantile estimator always has better accuracy:

Fast implementation of the moving quantile based on the partitioning heaps

Tue, 29 Dec 2020 00:00:00 +0000

Imagine you have a time series. Let’s say, after each new observation, you want to know an “average” value across the last $L$ observations. Such a metric is known as a moving average (or rolling/running average).

The most popular moving average example is the moving mean. It’s easy to efficiently implement this metric. However, it has a major drawback: it’s not robust. Outliers can easily spoil the moving mean and transform it into a meaningless and untrustable metric.

Fortunately, we have a good alternative: the moving median. Typically, it generates a stable and smooth series of values. In the below figure, you can see the difference between the moving mean and the moving median on noisy data.

The moving median also has a drawback: it’s not easy to efficiently implement it. Today we going to discuss the Hardle-Steiger method to estimate the median (memory: $O(L)$, element processing complexity: $O(log(L))$, median estimating complexity: $O(1)$). Also, we will learn how to calculate the moving quantiles based on this method.

In this post, you will find the following:

An overview of the Hardle-Steiger method
A simple way to implement the Hardle-Steiger method
Moving quantiles inspired by the Hardle-Steiger method
How to process initial elements
Reference C# implementation

Coverage of quantile confidence intervals

Tue, 22 Dec 2020 00:00:00 +0000

There is a common misunderstanding that a 95% confidence interval is an interval that covers the true parameter value with 95% probability. Meanwhile, the correct definition assumes that the true parameter value will be covered by 95% of 95% confidence intervals in the long run. These two statements sound similar, but there is a huge difference between them. 95% in this context is not a property of a single confidence interval. Once you get a calculated interval, it may cover the true value (100% probability) or it may don’t cover it (0% probability). In fact, 95% is a prediction about the percentage of future confidence intervals that cover the true value in the long run.

However, even if you know the correct definition, you still may experience some troubles. The first thing people usually forgot is the “long run” part. For example, if we collected 100 samples and calculated a 95% confidence interval of a parameter for each of them, we shouldn’t expect that 95 of these intervals cover the true parameter value. In fact, we can observe a situation when none of these intervals covers the true value. Of course, this is an unlikely event, but if you automatically perform thousands of different experiments, you will definitely get some extreme situations.

The second thing that may create trouble is the “prediction” part. If weather forecasters predicted that it will rain tomorrow, this does not mean that it will rain tomorrow. The same works for statistical predictions. The actual prediction reliability may depend on many factors. If you estimate confidence intervals around the mean for the normal distribution, you are most likely safe. However, if you estimate confidence intervals around quantiles for non-parametric distributions, you should care about the following things:

The used approach to estimate confidence intervals
The underlying distribution
The sample size
The position of the target quantile

I have already showed how to estimate the confidence interval around the given quantile using the Maritz-Jarrett method. It’s time to verify the reliability of this approach. In this post, I’m going to show some Monte-Carlo simulations that evaluate the coverage percentage in different situations.

Statistical approaches for performance analysis

Tue, 15 Dec 2020 00:00:00 +0000

Software performance is a complex discipline that requires knowledge in different areas from benchmarking to the internals of modern runtimes, operating systems, and hardware. Surprisingly, the most difficult challenges in performance analysis are not about programming, they are about mathematical statistics!

Many software developers can drill into performance problems and implement excellent optimizations, but they are not always know how to correctly verify these optimizations. This may not look like a problem in the case of a single performance investigation. However, the situation became worse when developers try to set up an infrastructure that should automatically find performance problems or prevent degradations from merging. In order to make such an infrastructure reliable and useful, it’s crucial to achieve an extremely low false-positive rate (otherwise, it’s not trustable) and be able to detect most of the degradations (otherwise, it’s not so useful). It’s not easy if you don’t know which statistical approaches should be used. If you try to google it, you may find thousands of papers about statistics, but only a small portion of them really works in practice.

In this post, I want to share some approaches that I use for performance analysis in everyday life. I have been analyzing performance distributions for the last seven years, and I have found a lot of approaches, metrics, and tricks which nice to have in your statistical toolbox. I would not say that all of them are must have to know, but they can definitely help you to improve the reliability of your statistical checks in different problems of performance analysis. Consider the below list as a letter to a younger version of myself with a brief list of topics that are good to learn.

Quantile confidence intervals for weighted samples

Tue, 08 Dec 2020 00:00:00 +0000

Update 2021-07-06: the approach was updated using the Kish’s effective sample size.

When you work with non-parametric distributions, quantile estimations are essential to get the main distribution properties. Once you get the estimation values, you may be interested in measuring the accuracy of these estimations. Without it, it’s hard to understand how trustable the obtained values are. One of the most popular ways to evaluate accuracy is confidence interval estimation.

Now imagine that you collect some measurements every day. Each day you get a small sample of values that is not enough to get the accurate daily quantile estimations. However, the full time-series over the last several weeks has a decent size. You suspect that past measurements should be similar to today measurements, but you are not 100% sure about it. You feel a temptation to extend the up-to-date sample by the previously collected values, but it may spoil the estimation (e.g., in the case of recent change points or positive/negative trends).

One of the possible approaches in this situation is to use weighted samples. This assumes that we add past measurements to the “today sample,” but these values should have smaller weight. The older measurement we take, the smaller weight it gets. If you have consistent values across the last several days, this approach works like a charm. If you have any recent changes, you can detect such situations by huge confidence intervals due to the sample inconsistency.

So, how do we estimate confidence intervals around quantiles for the weighted samples? In one of the previous posts, I have already shown how to estimate quantiles on weighted samples. In this post, I will show how to estimate quantile confidence intervals for weighted samples.

Quantile absolute deviation: estimating statistical dispersion around quantiles

Tue, 01 Dec 2020 00:00:00 +0000

There are many different metrics for statistical dispersion. The most famous one is the standard deviation. The standard deviation is the most popular way to describe the spread around the mean when you work with normally distributed data. However, if you work with non-normal distributions, this metric may be misleading.

In the world of non-parametric distributions, the most common measure of central tendency is the median. For the median, you can describe dispersion using the median absolute deviation around the median (MAD). It works great if the median is the only summary statistic that you care about. However, if you work with multimodal distributions (they can be detected using the lowland multimodality detector), you may be interested in other quantiles as well. So, it makes sense to learn how to describe dispersion around the given quantile. Which metric should we choose?

Recently, I came up with a great solution to this problem. We can generalize the median absolute deviation into the quantile absolute deviation (QAD) around the given quantile based on the Harrell-Davis quantile estimator. I will show how to calculate it, how to interpret it, and how to get insights about distribution properties from images like this one:

P² quantile estimator: estimating the median without storing values

Tue, 24 Nov 2020 00:00:00 +0000

Update: the estimator accuracy could be improved using a bunch of patches.

Imagine that you are implementing performance telemetry in your application. There is an operation that is executed millions of times, and you want to get its “average” duration. It’s not a good idea to use the arithmetic mean because the obtained value can be easily spoiled by outliers. It’s much better to use the median which is one of the most robust ways to describe the average.

The straightforward median estimation approach requires storing all the values. In our case, it’s a bad idea to keep all the values because it will significantly increase the memory footprint. Such telemetry is harmful because it may become a new bottleneck instead of monitoring the actual performance.

Another way to get the median value is to use a sequential quantile estimator (also known as an online quantile estimator or a streaming quantile estimator). This is an algorithm that allows calculating the median value (or any other quantile value) using a fixed amount of memory. Of course, it provides only an approximation of the real median value, but it’s usually enough for typical telemetry use cases.

In this post, I will show one of the simplest sequential quantile estimators that is called the P² quantile estimator (or the Piecewise-Parabolic quantile estimator).

Plain-text summary notation for multimodal distributions

Tue, 17 Nov 2020 00:00:00 +0000

Let’s say you collected a lot of data and want to explore the underlying distributions of collected samples. If you have only a few distributions, the best way to do that is to look at the density plots (expressed via histograms, kernel density estimations, or quantile-respectful density estimations). However, it’s not always possible.

Suppose you have to process dozens, hundreds, or even thousands of distributions. In that case, it may be extremely time-consuming to manually check visualizations of each distribution. If you analyze distributions from the command line or send notifications about suspicious samples, it may be impossible to embed images in the reports. In these cases, there is a need to present a distribution using plain text.

One way to do that is plain text histograms. Unfortunately, this kind of visualization may occupy o lot of space. In complicated cases, you may need 20 or 30 lines per a single distribution.

Another way is to present classic summary statistics like mean or median, standard deviation or median absolute deviation, quantiles, skewness, and kurtosis. There is another problem here: without experience, it’s hard to reconstruct the true distribution shape based on these values. Even if you are an experienced researcher, the statistical metrics may become misleading in the case of multimodal distributions. Multimodality is one of the most severe challenges in distribution analysis because it distorts basic summary statistics. It’s important to not only find such distribution but also have a way to present brief information about multimodality effects.

So, how can we condense the underlying distribution shape of a given sample to a short text line? I didn’t manage to find an approach that works fine in my cases, so I came up with my own notation. Most of the interpretation problems in my experiments arise from multimodality and outliers, so I decided to focus on these two things and specifically highlight them. Let’s consider this plot:

I suggest describing it like this:

{1.00, 2.00} + [7.16; 13.12]_100 + {19.00} + [27.69; 32.34]_100 + {37.00..39.00}_3

Let me explain the suggested notation in detail.

Intermodal outliers

Tue, 10 Nov 2020 00:00:00 +0000

Outlier analysis is a typical step in distribution exploration. Usually, we work with the “lower outliers” (extremely low values) and the “upper outliers” (extremely high values). However, outliers are not always extreme values. In the general case, an outlier is a value that significantly differs from other values in the same sample. In the case of multimodal distribution, we can also consider outliers in the middle of the distribution. Let’s call such outliers that we found between modes the “intermodal outliers.”

Look at the above density plot. It’s a bimodal distribution that is formed as a combination of two unimodal distributions. Each of the unimodal distributions may have its own lower and upper outliers. When we merge them, the upper outliers of the first distribution and the lower outliers of the second distribution stop being lower or upper outliers. However, if these values don’t belong to the modes, they still are a subject of interest. In this post, I will show you how to detect such intermodal outliers and how they can be used to form a better distribution description.

Lowland multimodality detection

Tue, 03 Nov 2020 00:00:00 +0000

Multimodality is an essential feature of a distribution, which may create many troubles during automatic analysis. One of the best ways to work with such distributions is to detect all the modes in advance based on the given samples. Unfortunately, this problem is much harder than it looks like.

I tried many different approaches for multimodality detection, but none of them was good enough. During the past several years, my approach of choice was the mvalue-based modal test by Brendan Gregg. It works nicely in simple cases, but I was constantly stumbling over noisy samples where this algorithm doesn’t produce reliable results. Also, it has some limitations that make it inapplicable to some corner cases.

So, I needed a better approach. Here are my main requirements:

It should detect the exact mode locations and ranges
It should provide reliable results even on noisy samples
It should be able to detect multimodality even when some modes are extremely close to each other
It should work out of the box without tricky parameter tuning for each specific distribution

I failed to find such an algorithm anywhere, so I came up with my own! The current working title is “the lowland multimodality detector.” It takes an estimation of the probability density function (PDF) and tries to find “lowlands” (areas that are much lower than neighboring peaks). Next, it splits the plot by these lowlands and detects modes between them. For the PDF estimation, it uses the quantile-respectful density estimation based on the Harrell-Davis quantile estimator (QRDE-HD) (see also harrell1982) Let me explain how it works in detail.

Quantile-respectful density estimation based on the Harrell-Davis quantile estimator

Tue, 27 Oct 2020 00:00:00 +0000

The idea of this post was born when I was working on a presentation for my recent DotNext talk. It had a slide with a density plot like this:

Here we can see a density plot based on a sample with highlighted decile locations that split the plot into 10 equal parts. Before the conference, I have been reviewed by @VladimirSitnikv. He raised a reasonable concern: it doesn’t look like all the density plot segments are equal and contain exactly 10% of the whole plot. And he was right!

However, I didn’t make any miscalculations. I generated a real sample with 61 elements. Next, I build a density plot with the kernel density estimation (KDE) using the Sheather & Jones method and the normal kernel. Next, I calculated decile values using the Harrell-Davis quantile estimator. Although both the density plot and the decile values are calculated correctly and consistent with the sample, they are not consistent with each other! Indeed, such a density plot is just an estimation of the underlying distribution. It has its own decile values, which are not equal to the sample decile values regardless of the used quantile estimator. This problem is common for different kinds of visualization that presents density and quantiles at the same time (e.g., violin plots)

It leads us to a question: how should we present the shape of our data together with quantile values without confusing inconsistency in the final image? Today I will present a good solution: we should use the quantile-respectful density estimation based on the Harrell-Davis quantile estimator! I know the title is a bit long, but it’s not so complicated as it sounds. In this post, I will show how to build such plots. Also I will compare them to the classic histograms and kernel density estimations. As a bonus, I will demonstrate how awesome these plots are for multimodality detection.

Misleading histograms

Tue, 20 Oct 2020 00:00:00 +0000

Below you see two histograms. What could you say about them?

Most likely, you say that the first histogram is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both histograms are based on the same sample:

20.13, 19.94, 20.03, 20.06, 20.04, 19.98, 20.15, 19.99, 20.20, 19.99, 20.13, 20.22, 19.86, 19.97, 19.98, 20.06,
29.97, 29.73, 29.75, 30.13, 29.96, 29.82, 29.98, 30.12, 30.18, 29.95, 29.97, 29.82, 30.04, 29.93, 30.04, 30.07,
40.10, 39.93, 40.05, 39.82, 39.92, 39.91, 39.75, 40.00, 40.02, 39.96, 40.07, 39.92, 39.86, 40.04, 39.91, 40.14,
49.95, 50.06, 50.03, 49.92, 50.15, 50.06, 50.00, 50.02, 50.06, 50.00, 49.70, 50.02, 49.96, 50.01, 50.05, 50.13

Thus, the only difference between histograms is the offset!

Visualization is a simple way to understand the shape of your data. Unfortunately, this way may easily become a slippery slope. In the previous post, I have shown how density plots may deceive you when the bandwidth is poorly chosen. Today, we talk about histograms and why you can’t trust them in the general case.

The importance of kernel density estimation bandwidth

Tue, 13 Oct 2020 00:00:00 +0000

Below see two kernel density estimations. What could you say about them?

Most likely, you say that the first plot is based on a uniform distribution, and the second one is based on a multimodal distribution with four modes. Although this is not obvious from the plots, both density plots are based on the same sample:

21.370, 19.435, 20.363, 20.632, 20.404, 19.893, 21.511, 19.905, 22.018, 19.93,
31.304, 32.286, 28.611, 29.721, 29.866, 30.635, 29.715, 27.343, 27.559, 31.32,
39.693, 38.218, 39.828, 41.214, 41.895, 39.569, 39.742, 38.236, 40.460, 39.36,
50.455, 50.704, 51.035, 49.391, 50.504, 48.282, 49.215, 49.149, 47.585, 50.03

The only difference between plots is in bandwidth selection!

Bandwidth selection is crucial when you are trying to visualize your distributions. Unfortunately, most people just call a regular function to build a density plot and don’t think about how the bandwidth will be chosen. As a result, the plot may present data in the wrong way, which may lead to incorrect conclusions. Let’s discuss bandwidth selection in detail and figure out how to improve the correctness of your density plots. In this post, we will cover the following topics:

Kernel density estimation
How bandwidth selection affects plot smoothness
Which bandwidth selectors can we use
Which bandwidth selectors should we use
Insidious default bandwidth selectors in statistical packages

The median absolute deviation value of the Gumbel distribution

Tue, 06 Oct 2020 00:00:00 +0000

The Gumbel distribution is not only a useful model in the extreme value theory, but it’s also a nice example of a slightly right-skewed distribution (skewness $\approx 1.14$). Here is its density plot:

In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests. I also prefer the median absolute deviation (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions. Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution. I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself. In this post, you will find a short derivation and the result (spoiler: the exact value is 0.767049251325708 * β). The general approach of the MAD derivation is common for most distributions, so it can be easily reused.

Weighted quantile estimators

Tue, 29 Sep 2020 00:00:00 +0000

Update 2021-07-06: the approach was updated using the Kish’s effective sample size.

In this post, I will show how to calculate weighted quantile estimates and how to use them in practice.

Let’s start with a problem from real life. Imagine that you measure the total duration of a unit test executed daily on a CI server. Every day you get a single number that corresponds to the test duration from the latest revision for this day:

You collect a history of such measurements for 100 days. Now you want to describe the “actual” distribution of the performance measurements.

However, for the latest “actual” revision, you have only a single measurement, which is not enough to build a distribution. Also, you can’t build a distribution based on the last N measurements because they can contain change points that will spoil your results. So, what you really want to do is to use all the measurements, but older values should have a lower impact on the final distribution form.

Such a problem can be solved using the weighted quantiles! This powerful approach can be applied to any time series regardless of the domain area. In this post, we learn how to calculate and apply weighted quantiles.

Nonparametric Cohen's d-consistent effect size

Thu, 25 Jun 2020 00:00:00 +0000

Update: the second part of this post is available here.

The effect size is a common way to describe a difference between two distributions. When these distributions are normal, one of the most popular approaches to express the effect size is Cohen’s d. Unfortunately, it doesn’t work great for non-normal distributions.

In this post, I will show a robust Cohen’s d-consistent effect size formula for nonparametric distributions.

DoubleMAD outlier detector based on the Harrell-Davis quantile estimator

Mon, 22 Jun 2020 00:00:00 +0000

Outlier detection is an important step in data processing. Unfortunately, if the distribution is not normal (e.g., right-skewed and heavy-tailed), it’s hard to choose a robust outlier detection algorithm that will not be affected by tricky distribution properties. During the last several years, I tried many different approaches, but I was not satisfied with their results. Finally, I found an algorithm to which I have (almost) no complaints. It’s based on the double median absolute deviation and the Harrell-Davis quantile estimator. In this post, I will show how it works and why it’s better than some other approaches.

How ListSeparator Depends on Runtime and Operating System

Wed, 20 May 2020 00:00:00 +0000

This blog post was originally posted on JetBrains .NET blog.

In the two previous blog posts from this series, we discussed how socket errors and socket orders depend on the runtime and operating systems. For some, it may be obvious that some things are indeed specific to the operating system or the runtime, but often these issues come as a surprise and are only discovered when running our code on different systems. An interesting example that may bite us at runtime is using ListSeparator in our code. It should give us a common separator for list elements in a string. But is it really common? Let’s start our investigation by printing ListSeparator for the Russian language:

Console.WriteLine(new CultureInfo("ru-ru").TextInfo.ListSeparator);

On Windows, you will get the same result for .NET Framework, .NET Core, and Mono: the ListSeparator is ; (a semicolon). You will also get a semicolon on Mono+Unix. However, on .NET Core+Unix, you will get a non-breaking space.

How Sorting Order Depends on Runtime and Operating System

Wed, 13 May 2020 00:00:00 +0000

This blog post was originally posted on JetBrains .NET blog.

In Rider, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: jquery-1.4.1.js, jquery-1.4.1.min.js, jquery-1.4.1-vsdoc.js. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:

jquery-1.4.1.js
jquery-1.4.1.min.js
jquery-1.4.1-vsdoc.js

How Socket Error Codes Depend on Runtime and Operating System

Mon, 27 Apr 2020 00:00:00 +0000

This blog post was originally posted on JetBrains .NET blog.

Rider consists of several processes that send messages to each other via sockets. To ensure the reliability of the whole application, it’s important to properly handle all the socket errors. In our codebase, we had the following code which was adopted from Mono Debugger Libs and helps us communicate with debugger processes:

protected virtual bool ShouldRetryConnection (Exception ex, int attemptNumber)
{
 var sx = ex as SocketException;
 if (sx != null) {
 if (sx.ErrorCode == 10061) //connection refused
 return true;
 }
 return false;
}

In the case of a failed connection because of a “ConnectionRefused” error, we are retrying the connection attempt. It works fine with .NET Framework and Mono. However, once we migrated to .NET Core, this method no longer correctly detects the “connection refused” situation on Linux and macOS. If we open the SocketException documentation, we will learn that this class has three different properties with error codes:

SocketError SocketErrorCode: Gets the error code that is associated with this exception.
int ErrorCode: Gets the error code that is associated with this exception.
int NativeErrorCode: Gets the Win32 error code associated with this exception.

What's the difference between these properties? Should we expect different values on different runtimes or different operating systems? Which one should we use in production? Why do we have problems with ShouldRetryConnection on .NET Core? Let's figure it all out!

.NET Core performance revolution in Rider 2020.1

Tue, 14 Apr 2020 00:00:00 +0000

This blog post was originally posted on JetBrains .NET blog.

Many Rider users may know that the IDE has two main processes: frontend (Java-application based on the IntelliJ platform) and backend (.NET-application based on ReSharper). Since the first release of Rider, we’ve used Mono as the backend runtime on Linux and macOS. A few years ago, we decided to migrate to .NET Core. After resolving hundreds of technical challenges, we are finally ready to present the .NET Core edition of Rider!

In this blog post, we want to share the results of some benchmarks that compare the Mono-powered and the .NET Core-powered editions of Rider. You may find this interesting if you are also thinking about migrating to .NET Core, or if you just want a high-level overview of the improvements to Rider in terms of performance and footprint, following the migration. (Spoiler: they’re huge!)

Introducing perfolizer

Wed, 04 Mar 2020 00:00:00 +0000

Over the last 7 years, I’ve been maintaining BenchmarkDotNet; it’s a library that helps you to transform methods into benchmarks, track their performance, and share reproducible measurement experiments. Today, BenchmarkDotNet became the most popular .NET library for benchmarking which was adopted by 3500+ projects including .NET Core.

While it has tons of features for benchmarking that allows getting reliable and accurate measurements, it has a limited set of features for performance analysis. And it’s a problem for many developers. Lately, I started to get a lot of emails when people ask me “OK, I benchmarked my application and got tons of numbers. What should I do next?” It’s an excellent question that requires special tools. So, I decided to start another project that focuses specifically on performance analysis.

Meet perfolizer — a toolkit for performance analysis! The source code is available on GitHub under the MIT license.

Distribution comparison via the shift and ratio functions

Fri, 11 Oct 2019 00:00:00 +0000

When we compare two distributions, it’s not always enough to detect a statistically significant difference between them. In many cases, we also want to evaluate the magnitude of this difference. Let’s look at the following image:

On the left side, we can see a timeline plot with 2000 points (at the middle of this plot, the distribution was significantly changed). On the right side, you can see density plots for the left and the right side of the timeline plot (before and after the change). It’s a pretty simple case, the difference between distributions be expressed via the difference between mean values.

Now let’s look at a more tricky case:

Here we have a bimodal distribution; after the change, the left mode “moved right.” Now it’s much harder to evaluate the difference between distributions because the mean and the median values almost not changed: the right mode has the biggest impact on these metrics than the left more.

And here is a much more tricky case:

Here we also have a bimodal distribution; after the change, both modes moved: the left mode “moved right” and the right mode “moved left.” How should we describe the difference between these distributions now?

Normality is a myth

Wed, 09 Oct 2019 00:00:00 +0000

In many statistical papers, you can find the following phrase: “assuming that we have a normal distribution.” Probably, you saw plots of the normal distribution density function in some statistics textbooks, it looks like this:

The normal distribution is a pretty user-friendly mental model when we are trying to interpret the statistical metrics like mean and standard deviation. However, it may also be an insidious and misleading model when your distribution is not normal. There is a great sentence in the “Testing for normality” paper by R.C. Geary, 1947 (the quote was found here):

Normality is a myth; there never was, and never will be, a normal distribution.

I 100% agree with this statement. At least, if you are working with performance distributions (that are based on the multiple iterations of your benchmarks that measure the performance metrics of your applications), you should forget about normality. That’s how a typical performance distribution looks like (I built the below picture based on a real benchmark that measures the load time of assemblies when we open the Orchard solution in Rider on Linux):

Implementation of an efficient algorithm for changepoint detection: ED-PELT

Mon, 07 Oct 2019 00:00:00 +0000

Changepoint detection is an important task that has a lot of applications. For example, I use it to detect changes in the Rider performance test suite. It’s very important to detect not only performance degradations, but any kinds of performance changes (e.g., the variance may increase, or an unimodal distribution may be split into several modes). You can see examples of such changes in the following picture (we change the color when a changepoint is detected):

Unfortunately, it’s pretty hard to write a reliable and fast algorithm for changepoint detection. Recently, I found a cool paper haynes2016 that describes the ED-PELT algorithm. It has O(N*log(N)) complexity and pretty good detection accuracy. The reference implementation can be used via the changepoint.np R package. However, I can’t use R on our build server, so I decided to write my own C# implementation.

A story about slow NuGet package browsing

Tue, 08 May 2018 00:00:00 +0000

In Rider, we have integration tests which interact with api.nuget.org. Also, we have an internal service which monitors the performance of these tests. Two days ago, I noticed that some of these tests sometimes are running for too long. For example, nuget_NuGetTest_shouldUpgradeVersionForDotNetCore usually takes around 10 sec. However, in some cases, it takes around 110 sec, 210 sec, or 310 sec:

It looks very suspicious and increases the whole test suite duration. Also, our dashboard with performance degradations contains only such tests and some real degradations (which are introduced by the changes in our codebase) can go unnoticed. So, my colleagues and I decided to investigate it.

Cross-runtime .NET disassembly with BenchmarkDotNet

Tue, 10 Apr 2018 00:00:00 +0000

BenchmarkDotNet is a cool tool for benchmarking. It has a lot of useful features that help you with performance investigations. However, you can use these features even if you are not actually going to benchmark something. One of these features is DisassemblyDiagnoser. It shows you a disassembly listing of your code for all required runtimes. In this post, I will show you how to get disassembly listing for .NET Framework, .NET Core, and Mono with one click! You can do it with a very small code snippet like this:

[DryCoreJob, DryMonoJob, DryClrJob(Platform.X86)]
[DisassemblyDiagnoser]
public class IntroDisasm
{
 [Benchmark]
 public double Sum()
 {
 double res = 0;
 for (int i = 0; i < 64; i++)
 res += i;
 return res;
 }
}

BenchmarkDotNet v0.10.14

Mon, 09 Apr 2018 00:00:00 +0000

BenchmarkDotNet v0.10.14 has been released! This release includes:

Per-method parameterization (Read more)
Console histograms and multimodal disribution detection (Read more)
Many improvements for Mono disassembly support on Windows (A blog post is coming soon)
Many bugfixes

In the v0.10.14 scope, 8 issues were resolved and 11 pull requests where merged. This release includes 47 commits by 8 contributors.

BenchmarkDotNet v0.10.13

Fri, 02 Mar 2018 00:00:00 +0000

BenchmarkDotNet v0.10.13 has been released! This release includes:

Mono Support for DisassemblyDiagnoser: Now you can easily get an assembly listing not only on .NET Framework/.NET Core, but also on Mono. It works on Linux, macOS, and Windows (Windows requires installed cygwin with obj and as). (See #541)
Support ANY CoreFX and CoreCLR builds: BenchmarkDotNet allows the users to run their benchmarks against ANY CoreCLR and CoreFX builds. You can compare your local build vs MyGet feed or Debug vs Release or one version vs another. (See #651)
C# 7.2 support (See #643)
.NET 4.7.1 support (See 28aa94)
Support Visual Basic project files (.vbroj) targeting .NET Core (See #626)
DisassemblyDiagnoser now supports generic types (See #640)
Now it’s possible to benchmark both Mono and .NET Core from the same app (See #653)
Many bug fixes (See details below)

Analyzing distribution of Mono GC collections

Tue, 20 Feb 2018 00:00:00 +0000

Sometimes I want to understand the GC performance impact on an application quickly. I know that there are many powerful diagnostic tools and approaches, but I’m a fan of the “right tool for the job” idea. In simple cases, I prefer simple noninvasive approaches which provide a quick way to get an overview of the current situation (if everything is terrible, I always can switch to an advanced approach). Today I want to share with you my favorite way to quickly get statistics of GC pauses in Mono and generate nice plots like this:

BenchmarkDotNet v0.10.12

Mon, 15 Jan 2018 00:00:00 +0000

BenchmarkDotNet v0.10.12 has been released! This release includes:

Improved DisassemblyDiagnoser: BenchmarkDotNet contains an embedded disassembler so that it can print assembly code for all benchmarks; it’s not easy, but the disassembler evolves in every release.
Improved MemoryDiagnoser: it has a better precision level, and it takes less time to evaluate memory allocations in a benchmark.
New TailCallDiagnoser: now you get notifications when JIT applies the tail call optimizations to your methods.
Better environment info: when your share performance results, it’s very important to share information about your environment. The library generates the environment summary for you by default. Now it contains information about the amount of physical CPU, physical cores, and logic cores. If you run a benchmark on a virtual machine, you will get the name of the hypervisor (e.g., Hyper-V, VMware, or VirtualBox).
Better summary table: one of the greatest features of BenchmarkDotNet is the summary table. It shows all important information about results in a compact and understandable form. Now it has better customization options: you can display relative performance of different environments (e.g., compare .NET Framework and .NET Core) and group benchmarks by categories.
New GC settings: now we support NoAffinitize, HeapAffinitizeMask, HeapCount.
Other minor improvements and bug fixes

BenchmarkDotNet v0.10.10

Fri, 03 Nov 2017 00:00:00 +0000

BenchmarkDotNet v0.10.10 has been released! This release includes many new features like Disassembly Diagnoser, ParamsSources, .NET Core x86 support, Environment variables, and more!

Reflecting on performance testing

Tue, 01 Aug 2017 00:00:00 +0000

Performance is an important feature for many projects. Unfortunately, it’s an all too common situation when a developer accidentally spoils the performance adding some new code. After a series of such incidents, people often start to think about performance regression testing.

As developers, we write unit tests all the time. These tests check that our business logic work as designed and that new features don’t break existing code. It looks like a good idea to write some perf tests as well, which will verify that we don’t have any performance regressions.

Turns out this is harder than it sounds. A lot of developers don’t write perf tests at all. Some teams write perf tests, but almost all of them use their own infrastructure for analysis (which is not a bad thing in general because it’s usually designed for specific projects and requirements). There are a lot of books about test-driven development (TDD), but there are no books about performance-driven development (PDD). There are well-known libraries for unit-testing (like xUnit/NUnit/MSTest for .NET), but there are almost no libraries for performance regression testing. Yeah, of course, there are some libraries which you can use. But there are troubles with well-known all recognized libraries, approaches, and tools. Ask your colleagues about it: some of them will give you different answers, the rest of them will start Googling it.

There is no common understanding of what performance testing should look like. This situation exists because it’s really hard to develop a solution which solves all problems for all kind of projects. However, it doesn’t mean that we shouldn’t try. And we should try, we should share our experience and discuss best practices.

Measuring Performance Improvements in .NET Core with BenchmarkDotNet (Part 1)

Fri, 09 Jun 2017 00:00:00 +0000

A few days ago Stephen Toub published a great post at the Microsoft .NET Blog: Performance Improvements in .NET Core. He showed some significant performance changes in .NET Core 2.0 Preview 1 (compared with .NET Framework 4.7). The .NET Core uses RyuJIT for generating assembly code. When I first tried RyuJIT (e.g., CTP2, CTP5, 2014), I wasn’t excited about this: the preview versions had some bugs, and it worked slowly on my applications. However, the idea of a rethought and open-source JIT-compiler was a huge step forward and investment in the future. RyuJIT had been developed very actively in recent years: not only by Microsoft but with the help of the community. I’m still not happy about the generated assembly code in some methods, but I have to admit that the RyuJIT (as a part of .NET Core) works pretty well today: it shows a good performance level not only on artificial benchmarks but also on real user code. Also, there are a lot of changes not only in dotnet/coreclr (the .NET Core runtime), but also in dotnet/corefx (the .NET Core foundational libraries). It’s very nice to watch how the community helps to optimize well-used classes which have not changed for years.

Now let’s talk about benchmarks. For the demonstration, Stephen wrote a set of handwritten benchmarks. A few people (in comments and on HackerNews) asked about BenchmarkDotNet regarding these samples (as a better tool for performance measurements). So, I decided to try all these benchmarks on BenchmarkDotNet.

In this post, we will discuss how can BenchmarkDotNet help in such performance investigations, which benchmarking approaches (and when) are better to use, and how can we improve these measurements.

BenchmarkDotNet v0.10.7

Mon, 05 Jun 2017 00:00:00 +0000

BenchmarkDotNet v0.10.7 has been released. In this post, I will briefly cover the following features:

LINQPad support
Filters and categories
Updated Setup/Cleanup attributes
Better Value Types support
Building Sources on Linux

65535 interfaces ought to be enough for anybody

Tue, 14 Feb 2017 00:00:00 +0000

It was a bright, sunny morning. There were no signs of trouble. I came to work, opened Slack, and received many messages from my coworkers about failed tests.

After a few hours of investigation, the situation became clear:

I’m responsible for the unit tests subsystem in Rider, and only tests from this subsystem were failing.
I didn’t commit anything to the subsystem for a week because I worked with a local branch. Other developers also didn’t touch this code.
The unit tests subsystem is completely independent. It’s hard to imagine a situation when only the corresponded tests would fail, thousands of other tests pass, and there are no changes in the source code.
git blame helped to find the “bad commit”: it didn’t include anything suspicious, only a few additional classes in other subsystems.
Only tests on Linux and MacOS were red. On Windows, everything was ok.
Stacktraces in failed tests were completely random. We had a new stack trace in each test from different subsystems. There was no connection between these stack traces, unit tests source code, and the changes in the “bad commit.” There was no clue where we should look for a problem.

So, what was special about this “bad commit”? Spoiler: after these changes, we sometimes have more than 65535 interface implementations at runtime.

A bug story about named mutex on Mono

Mon, 13 Feb 2017 00:00:00 +0000

When you write some multithreading magic on .NET, you can use a cool synchronization primitive called Mutex:

var mutex = new Mutex(false, "Global\\MyNamedMutex");

You also can make it named (and share the mutex between processes) which works perfectly on Windows:

However, today the .NET Framework is cross-platform, so this code should work on any operation system. What will happen if you use named mutex on Linux or MacOS with the help of Mono or CoreCLR? Is it possible to create some tricky bug based on this case? Of course, it does. Today I want to tell you a story about such bug in Rider which was a headache for several weeks.

InvalidDataException in Process.GetProcesses

Fri, 10 Feb 2017 00:00:00 +0000

Consider the following program:

public static void Main(string[] args)
{
 try
 {
 Process.GetProcesses();
 }
 catch (Exception e)
 {
 Console.WriteLine(e);
 }
}

It seems that all exceptions should be caught. However, sometimes, I had the following exception on Linux with dotnet cli-1.0.0-preview2:

$ dotnet run
System.IO.InvalidDataException: Found invalid data while decoding.
 at System.IO.StringParser.ParseNextChar()
 at Interop.procfs.TryParseStatFile(String statFilePath, ParsedStat& result, ReusableTextReader reusableReader)
 at System.Diagnostics.ProcessManager.CreateProcessInfo(ParsedStat procFsStat, ReusableTextReader reusableReader)
 at System.Diagnostics.ProcessManager.CreateProcessInfo(Int32 pid, ReusableTextReader reusableReader)
 at System.Diagnostics.ProcessManager.GetProcessInfos(String machineName)
 at System.Diagnostics.Process.GetProcesses(String machineName)
 at System.Diagnostics.Process.GetProcesses()
 at DotNetCoreConsoleApplication.Program.Main(String[] args) in /home/akinshin/Program.cs:line 12

How is that possible?

Why is NuGet search in Rider so fast?

Wed, 08 Feb 2017 00:00:00 +0000

I’m the guy who develops the NuGet manager in Rider. It’s not ready yet, there are some bugs here and there, but it already works pretty well. The feature which I am most proud of is smart and fast search:

Today I want to share with you some technical details about how it was implemented.

NuGet2 and a DirectorySeparatorChar bug

Mon, 06 Feb 2017 00:00:00 +0000

In Rider, we care a lot about performance. I like to improve the application responsiveness and do interesting optimizations all the time. Rider is already well-optimized, and it’s often hard to make significant performance improvements, so usually I do micro-optimizations which do not have a very big impact on the whole application. However, sometimes it’s possible to improve the speed of a feature 100 times with just a few lines of code.

Rider is based on ReSharper, so we have a lot of cool features out of the box. One of these features is Solution-Wide Analysis which lets you constantly keep track of issues in your solution. Sometimes, solution-wide analysis takes a lot of time to run because there are many files which should be analyzed. Of course, it works super fast on small and projects.

Let’s talk about a performance bug (#RIDER-3742) that we recently had.

Repro: Open Rider, create a new “ASP .NET MVC Application”, enable solution wide-analysis.
Expected: The analysis should take 1 second.
Actual: The analysis takes 1 second on Windows and 2 minutes on Linux and MacOS.

Performance exercise: Division

Mon, 26 Dec 2016 00:00:00 +0000

In the previous post, we discussed the performance space of the minimum function which was implemented via a simple ternary operator and with the help of bit magic. Now we continue to talk about performance and bit hacks. In particular, we will divide a positive number by three:

uint Div3Simple(uint n) => n / 3;
uint Div3BitHacks(uint n) => (uint)((n * (ulong)0xAAAAAAAB) >> 33);

As usual, it’s hard to say which method is faster in advanced because the performance depends on the environment. Here are some interesting results:

	Simple	BitHacks
LegacyJIT-x86	≈8.3ns	≈2.6ns
LegacyJIT-x64	≈2.6ns	≈1.7ns
RyuJIT-x64	≈6.9ns	≈1.5ns
Mono4.6.2-x86	≈8.5ns	≈14.4ns
Mono4.6.2-x64	≈8.3ns	≈2.8ns

Performance exercise: Minimum

Tue, 20 Dec 2016 00:00:00 +0000

Performance is tricky. Especially, if you are working with very fast operations. In today benchmarking exercise, we will try to measure performance of two simple methods which calculate minimum of two numbers. Sounds easy? Ok, let’s do it, here are our guinea pigs for today:

int MinTernary(int x, int y) => x < y ? x : y;
int MinBitHacks(int x, int y) => x & ((x - y) >> 31) | y & (~(x - y) >> 31);

And here are some results:

	Random		Const
	Ternary	BitHacks	Ternary	BitHacks
LegacyJIT-x86	≈643µs	≈227µs	≈160µs	≈226µs
LegacyJIT-x64	≈450µs	≈123µs	≈68µs	≈123µs
RyuJIT-x64	≈594µs	≈241µs	≈180µs	≈241µs
Mono-x64	≈203µs	≈283µs	≈204µs	≈282µs

What’s going on here? Let’s discuss it in detail.

Stopwatch under the hood

Fri, 09 Sep 2016 00:00:00 +0000

Update: You can find an updated and significantly improved version of this post in my book “Pro .NET Benchmarking”.

In the previous post, we discussed DateTime. This structure can be used in situations when you don’t need a good level of precision. If you want to do high-precision time measurements, you need a better tool because DateTime has a small resolution and a big latency. Also, time is tricky, you can create wonderful bugs if you don’t understand how it works (see Falsehoods programmers believe about time and More falsehoods programmers believe about time).

In this post, we will briefly talk about the Stopwatch class:

Which kind of hardware timers could be a base for Stopwatch
High precision timestamp API on Windows and Linux
Latency and Resolution of Stopwatch in different environments
Common pitfalls: which kind of problems could we get trying to measure small time intervals

If you are not a .NET developer, you can also find a lot of useful information in this post: mainly we will discuss low-level details of high-resolution timestamping (probably your favorite language also uses the same API). As usual, you can also find useful links for further reading.

DateTime under the hood

Fri, 19 Aug 2016 00:00:00 +0000

Update: You can find an updated and significantly improved version of this post in my book “Pro .NET Benchmarking”.

DateTime is a widely used .NET type. A lot of developers use it all the time, but not all of them really know how it works. In this post, I discuss DateTime.UtcNow: how it’s implemented, what the latency and the resolution of DateTime on Windows and Linux, how the resolution can be changed, and how it can affect your application. This post is an overview, so you probably will not see super detailed explanations of some topics, but you will find a lot of useful links for further reading.

LegacyJIT-x86 and first method call

Mon, 04 Apr 2016 00:00:00 +0000

Today I tell you about one of my favorite benchmarks (this method doesn’t return a useful value, we need it only as an example):

[Benchmark]
public string Sum()
{
 double a = 1, b = 1;
 var sw = new Stopwatch();
 for (int i = 0; i < 10001; i++)
 a = a + b;
 return string.Format("{0}{1}", a, sw.ElapsedMilliseconds);
}

An interesting fact: if you call Stopwatch.GetTimestamp() before the first call of the Sum method, you improve Sum performance several times (works only with LegacyJIT-x86).

Visual Studio and ProjectTypeGuids.cs

Sat, 27 Feb 2016 00:00:00 +0000

It’s a story about how I tried to open a project in Visual Studio for a few hours. The other day, I was going to do some work. I pulled last commits from a repo, opened Visual Studio, and prepared to start coding. However, one of a project in my solution failed to open with a strange message:

error : The operation could not be completed.

In the Solution Explorer, I had “load failed” as a project status and the following message instead of the file tree: “The project requires user input. Reload the project for more information.” Hmm, ok, I reloaded the project and got a few more errors:

error : The operation could not be completed.
error : The operation could not be completed.

Blittable types

Thu, 26 Nov 2015 00:00:00 +0000

Challenge of the day: what will the following code display?

[StructLayout(LayoutKind.Explicit)]
public struct UInt128
{
 [FieldOffset(0)]
 public ulong Value1;
 [FieldOffset(8)]
 public ulong Value2;
}
[StructLayout(LayoutKind.Sequential)]
public struct MyStruct
{
 public UInt128 UInt128;
 public char Char;
}
class Program
{
 public static unsafe void Main()
 {
 var myStruct = new MyStruct();
 var baseAddress = (int)&myStruct;
 var uInt128Adress = (int)&myStruct.UInt128;
 Console.WriteLine(uInt128Adress - baseAddress);
 Console.WriteLine(Marshal.OffsetOf(typeof(MyStruct), "UInt128"));
 }
}

A hint: two zeros or two another same values are wrong answers in the general case. The following table shows the console output on different runtimes:

	MS.NET-x86	MS.NET-x64	Mono
uInt128Adress - baseAddress	4	8	0
Marshal.OffsetOf(typeof(MyStruct), "UInt128")	0	0	0

If you want to know why it happens, you probably should learn some useful information about blittable types.

RyuJIT RC and constant folding

Tue, 12 May 2015 00:00:00 +0000

Update: The below results are valid for the release version of RyuJIT in .NET Framework 4.6 without updates.

The challenge of the day: which method is faster?

public double Sqrt13()
{
 return Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + Math.Sqrt(4) + Math.Sqrt(5) + 
 Math.Sqrt(6) + Math.Sqrt(7) + Math.Sqrt(8) + Math.Sqrt(9) + Math.Sqrt(10) + 
 Math.Sqrt(11) + Math.Sqrt(12) + Math.Sqrt(13);
}
public double Sqrt14()
{
 return Math.Sqrt(1) + Math.Sqrt(2) + Math.Sqrt(3) + Math.Sqrt(4) + Math.Sqrt(5) + 
 Math.Sqrt(6) + Math.Sqrt(7) + Math.Sqrt(8) + Math.Sqrt(9) + Math.Sqrt(10) + 
 Math.Sqrt(11) + Math.Sqrt(12) + Math.Sqrt(13) + Math.Sqrt(14);
}

I have measured the methods performance with help of BenchmarkDotNet for RyuJIT RC (a part of .NET Framework 4.6 RC) and received the following results:

// BenchmarkDotNet=v0.7.4.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4702MQ CPU ＠ 2.20GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.0, Arch=64-bit [RyuJIT]
Common: Type=Math_DoubleSqrtAvx Mode=Throughput Platform=X64 Jit=RyuJit .NET=Current 

 Method | AvrTime | StdDev | op/s |
------- |--------- |---------- |------------- |
 Sqrt13 | 55.40 ns | 0.571 ns | 18050993.06 |
 Sqrt14 | 1.43 ns | 0.0224 ns | 697125029.18 |

How so? If I add one more Math.Sqrt to the expression, the method starts work 40 times faster! Let’s examine the situation..

Unrolling of small loops in different JIT versions

Mon, 02 Mar 2015 00:00:00 +0000

Challenge of the day: what will the following code display?

struct Point
{
 public int X;
 public int Y;
}
static void Print(Point p)
{
 Console.WriteLine(p.X + " " + p.Y);
}
static void Main()
{
 var p = new Point();
 for (p.X = 0; p.X < 2; p.X++)
 Print(p);
}

The right answer: it depends. There is a bug in CLR2 JIT-x86 which spoil this wonderful program. This story is about optimization that called unrolling of small loops. This is a very interesting theme, let’s discuss it in detail.

RyuJIT CTP5 and loop unrolling

Sun, 01 Mar 2015 00:00:00 +0000

RyuJIT will be available soon. It is a next generation JIT-compiler for .NET-applications. Microsoft likes to tell us about the benefits of SIMD using and JIT-compilation time reducing. But what about basic code optimization which is usually applying by a compiler? Today we talk about the loop unrolling (unwinding) optimization. In general, in this type of code optimization, the code

for (int i = 0; i < 1024; i++)
 Foo(i);

transforms to

for (int i = 0; i < 1024; i += 4)
{
 Foo(i);
 Foo(i + 1);
 Foo(i + 2);
 Foo(i + 3);
}

Such approach can significantly increase performance of your code. So, what’s about loop unrolling in .NET?

JIT version determining in runtime

Sat, 28 Feb 2015 00:00:00 +0000

Sometimes I want to know used JIT compiler version in my little C# experiments. It is clear that it is possible to determine the version in advance based on the environment. However, sometimes I want to know it in runtime to perform specific code for the current JIT compiler. More formally, I want to get the value from the following enum:

public enum JitVersion
{
 Mono, MsX86, MsX64, RyuJit
}

It is easy to detect Mono by existing of the Mono.Runtime class. Otherwise, we can assume that we work with Microsoft JIT implementation. It is easy to detect JIT-x86 with help of IntPtr.Size == 4. The challenge is to distinguish JIT-x64 and RyuJIT. Next, I will show how you can do it with help of the bug from my previous post.

A bug story about JIT-x64

Fri, 27 Feb 2015 00:00:00 +0000

Can you say, what will the following code display for step=1?

public void Foo(int step)
{
 for (int i = 0; i < step; i++)
 {
 bar = i + 10;
 for (int j = 0; j < 2 * step; j += step)
 Console.WriteLine(j + 10);
 }
}

If you think about specific numbers, you are wrong. The right answer: it depends. The post title suggests to us, the program can has a strange behavior for x64.

A story about JIT-x86 inlining and starg

Thu, 26 Feb 2015 00:00:00 +0000

Sometimes you can learn a lot during reading source .NET. Let’s open the source code of a Decimal constructor from .NET Reference Source (mscorlib/system/decimal.cs,158):

// Constructs a Decimal from an integer value.
//
public Decimal(int value) {
 // JIT today can't inline methods that contains "starg" opcode.
 // For more details, see DevDiv Bugs 81184: x86 JIT CQ: Removing the inline striction of "starg".
 int value_copy = value;
 if (value_copy >= 0) {
 flags = 0;
 }
 else {
 flags = SignMask;
 value_copy = -value_copy;
 }
 lo = value_copy;
 mid = 0;
 hi = 0;
}

The comment states that JIT-x86 can’t apply the inlining optimization for a method that contains the starg IL-opcode. Curious, is not it?

About UTF-8 conversions in Mono

Mon, 10 Nov 2014 00:00:00 +0000

This post is a logical continuation of the Jon Skeet’s blog post “When is a string not a string?”. Jon showed very interesting things about behavior of ill-formed Unicode strings in .NET. I wondered about how similar examples will work on Mono. And I have got very interesting results.

Experiment 1: Compilation

Let’s take the Jon’s code with a small modification. We will just add text null check in DumpString:

using System;
using System.ComponentModel;
using System.Text;
using System.Linq;
[Description(Value)]
class Test
{
 const string Value = "X\ud800Y";
 static void Main()
 {
 var description = (DescriptionAttribute)typeof(Test).
 GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
 DumpString("Attribute", description.Description);
 DumpString("Constant", Value);
 }
 static void DumpString(string name, string text)
 {
 Console.Write("{0}: ", name);
 if (text != null)
 {
 var utf16 = text.Select(c => ((uint) c).ToString("x4"));
 Console.WriteLine(string.Join(" ", utf16));
 }
 else
 Console.WriteLine("null");
 }
}

Happy Monday!

Mon, 11 Aug 2014 00:00:00 +0000

Today I tell you a story about one tricky bug. The bug is a tricky one because it doesn’t allow me to debug my application on Mondays. I’m serious right now: the debug mode doesn’t work every Monday. Furthermore, the bug literally tell me: “Happy Monday!”.

So, the story. It was a wonderful Sunday evening, no signs of trouble. We planned to release a new version of our software (a minor one, but it includes some useful features). Midnight on the clock. Suddenly, I came up with the idea that we have a minor bug that should be fixed. It requires a few lines of code and 10 minutes to do it. And I decided to write needed logic before I go to sleep. I open VisualStudio, lunch build, and wait. But something goes wrong, because I get the following error:

Error connecting to the pipe server.

Hmm. It is a strange error.

To Refactor Or Not To Refactor?

Sat, 19 Jul 2014 00:00:00 +0000

I like refactoring. No, I love refactoring. No, not even like this. I awfully love refactoring.

I hate bad code and bad architecture. I feel quite creepy when I design a new feature and the near-by class contains absolute mess. I just can’t look at the sadly-looking variables. Sometimes before falling asleep I close my eyes and imagine what could be improved in the project. Sometimes I wake up at 3:00AM and go to my computer to improve something. I want to have not just code, but a masterpiece that is pleasant to look at, that is pleasant to work with at any stage of the project.

If you just a little bit share my feelings we have something to talk about. The matter is that over some time something inside me began to hint that it’s a bad idea to refactor all code, everywhere and all the time. Understand me correctly – code should be good (even better when it’s ideal), but in real life it’s not reasonable to improve code instantly. I formed some rules about the refactoring timeliness. If I am itching to improve something, I look at these rules and think “Is that the moment when I need to refactor the code?” So, let’s talk about when refactoring is necessary and when it’s inappropriate.

Strange behavior of FindElementsInHostCoordinates in WinRT

Tue, 29 Apr 2014 00:00:00 +0000

Silverlight features a splendid method: VisualTreeHelper.FindElementsInHostCoordinates. It allows the HitTest, i.e. makes it possible for a point or rectangle to search for all visual sub-tree objects that intersect this rectangle or point. Formally the same method VisualTreeHelper.FindElementsInHostCoordinates is available in WinRT. And it seems the method looks in the same way, but there is a little nuance. It works differently in different versions of the platform. So, let’s see what’s going on.

About System.Drawing.Color and operator ==

Fri, 21 Feb 2014 00:00:00 +0000

Operator == that allows easy comparison of your objects is overridden for many standard structures in .NET. Unfortunately, not every developer really knows what is actually compared when working with this wonderful operator. This brief blog post will show the comparison logic based on a sample of System.Drawing.Color. What do you think the following code will get:

var redName = Color.Red;
var redArgb = Color.FromArgb(255, 255, 0, 0);
Console.WriteLine(redName == redArgb);

Setting up build configuration in .NET

Sat, 08 Feb 2014 00:00:00 +0000

You get two default build configurations: Debug and Release, when creating a new project in Visual Studio. And it’s enough for most small projects. But there can appear a necessity to extend it with the additional configurations. It’s ok if you need to add just a couple of new settings, but what if there are tens of such settings? And what if your solution contains 20 projects that need setting up of these configurations? In this case it becomes quite difficult to manage and modify build parameters.

In this article, we will review a way to make this process simpler by reducing description of the build configurations.

Jon Skeet's Quiz

Sun, 03 Nov 2013 00:00:00 +0000

Jon Skeet was once asked to give three questions to check how well you know C#. He asked the following questions:

Q1. What constructor call can you write such that this prints True (at least on the Microsoft .NET implementation)?

object x = new /* fill in code here */;
object y = new /* fill in code here */;
Console.WriteLine(x == y);

Note that it’s just a constructor call, and you can’t change the type of the variables.

Q2. How can you make this code compile such that it calls three different method overloads?

void Foo()
{
 EvilMethod<string>();
 EvilMethod<int>();
 EvilMethod<int?>();
}

Q3. With a local variable (so no changing the variable value cunningly), how can you make this code fail on the second line?

string text = x.ToString(); // No exception
Type type = x.GetType(); // Bang!

These questions seemed interesting to me, that is why I decided to discuss the solutions.

Perfect code and real projects

Wed, 28 Aug 2013 00:00:00 +0000

I’ve got a problem. I am a perfectionist. I like perfect code. This is not only the correct way to develop applications but also the real proficiency. I enjoy reading a good listing not less than reading a good book. Developing architecture of a big project is no simpler than designing architecture of a big building. In case the work is good the result is no less beautiful. I am sometimes fascinated by how elegantly the patterns are entwined in the perfect software system. I am delighted by the attention to details when every method is so simple and understandable that can be a classic sample of the perfect code. But, unfortunately, this splendor is ruined by stern reality and real projects. If we talk about production project, users don’t care how beautiful your code is and how wonderful your architecture is, they care to have a properly working project. But I still think that in any case you need to strive for writing good code, but without getting stuck on this idea. After reading various holy-war discussions related to correct approaches to writing code I noticed a trend: everyone tries to apply the mentioned approaches not to programming in general, but to personal development experience, to their own projects. Many developers don’t understand that good practice is not an absolute rule that should be followed in 100% of scenarios. It’s just an advice on what to do in most cases. You can get a dozen of scenarios where the practice won’t work at all. But it doesn’t mean that the approach is not that good, it’s just used in the wrong environment. There is another problem: some developers are not that good as they think. I often see the following situation: such developer got some idea (without getting deep into details) in the big article about the perfect code and he started to use it everywhere and the developer’s code became even worse.

To Add Comments or Not to Add?

Wed, 28 Aug 2013 00:00:00 +0000

A really good comment is the one you managed to avoid. (c) Uncle Bob

Lately, I’ve been feeling really tired of hot discussions on if it’s necessary to add comments in the code. As a rule, there are self-confident juniors with the indisputable statement as: “Why not to comment it, it will be unreadable without the comments!” on one side. And experienced seniors are on the other side. They understand that if it’s possible to go without the comments than “You better, damn it, do it in this way!” Probably, many developers got comment cravings since they’ve been students when professors made them comment every code line, “to make the student better understand it”. Real projects shouldn’t contain a lot of comments that only spoil the code. I don’t agitate for avoiding comments at all, but if you managed to write the code that doesn’t need comments, you can consider it your small victory. I would like to refer you to some good books that helped form my position. I like and respect these authors and completely share their opinion.

Unexpected area to collect garbage in .NET

Thu, 08 Aug 2013 00:00:00 +0000

The .NET framework provides an intelligent garbage collector that saves us a trouble of manual memory management. And in 95% of cases you can forget about memory and related issues. But the remaining 5% have some specific aspects connected to unmanaged resources, too big objects, etc. And it’s better to know how the garbage is collected. Otherwise, you can get surprises.

Do you think GC is able to collect an object till its last method is complete? It appears it is. But it is necessary to run an application in release mode without debugging. In this case JIT compiler will perform optimizations that will make this situation possible. Of course, JIT compiler does it when the remaining method body doesn’t contain references to the object or its fields. It should seem a very harmless optimization. But it can lead to the problems if you work with the unmanaged resources: object compilation can be executed before the operation over the unmanaged resource is finished. And most likely it will result in the application crash.

Unobviousness in use of C# closures

Wed, 07 Aug 2013 00:00:00 +0000

C# gives us an ability to use closures. This is a powerful tool that allows anonymous methods and lambda-functions to capture unbound variables in their lexical scope. And many programmers in .NET world like using closures very much, but only few of them understand how they really work. Let’s start with a simple sample:

public void Run()
{
 int e = 1;
 Foo(x => x + e);
}

Nothing complicated happens here: we just captured a local variable e in its lambda that is passed to some Foo method. Let’s see how the compiler will expand such construction.*

public void Run()
{
 DisplayClass c = new DisplayClass();
 c.e = 1; 
 Foo(c.Action);
}
private sealed class DisplayClass
{
 public int e;
 public int Action(int x)
 {
 return x + e;
 }
}

Wrapping C# class for use in COM

Mon, 03 Jun 2013 00:00:00 +0000

Let us have a C# class that makes something useful, for example:

public class Calculator
{
 public int Sum(int a, int b)
 {
 return a + b;
 }
}

Let’s create a COM interface for this class to make it possible to use its functionality in other areas. At the end we will see how this class is used in Delphi environment.