Misleading skewness



Skewness is a commonly used measure of the asymmetry of the probability distributions. A typical skewness interpretation comes down to an image like this:


It looks extremely simple: using the skewness sign, we get an idea of the distribution form and the arrangement of the mean and the median. Unfortunately, it doesn’t always work as expected. Skewness estimation could be a highly misleading metric (even more misleading than the standard deviation). In this post, I discuss four sources of its misleadingness:

  • “Skewness” is a generic term; it has multiple definitions. When a skewness value is presented, you can’t always guess the underlying equation without additional details.
  • Skewness is “designed” for unimodal distributions; it’s meaningless in the case of multimodality.
  • Most default skewness definitions are not robust: a single outlier could completely distort the skewness value.
  • We can’t make conclusions about the locations of the mean and the median based on the skewness sign.

Read more


Greenwald-Khanna quantile estimator



The Greenwald-Khanna quantile estimator is a classic sequential quantile estimator which has the following features:

  • It allows estimating quantiles with respect to the given precision \(\epsilon\).
  • It requires \(O(\frac{1}{\epsilon} log(\epsilon N))\) memory in the worst case.
  • It doesn’t require knowledge of the total number of elements in the sequence and the positions of the requested quantiles.

In this post, I briefly explain the basic idea of the underlying data structure, and share a copy-pastable C# implementation. At the end of the post, I discuss some important implementation decisions that are unclear from the original paper, but heavily affect the estimator accuracy.


Read more


P² quantile estimator rounding issue



Update: the estimator accuracy could be improved using a bunch of patches.

The P² quantile estimator is a sequential estimator that uses \(O(1)\) memory. Thus, for the given sequence of numbers, it allows estimating quantiles without storing values. I already wrote a blog post about this approach and added its implementation in perfolizer. Recently, I got a bug report that revealed a flaw of the original paper. In this post, I’m going to briefly discuss this issue and the corresponding fix.


Read more


Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width



This post aggregates research from several blog posts that I published during this year. It presents an overview of the Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width. A corresponding preprint is available on arXiv: arXiv:2111.11776 [stat.ME].

Traditional quantile estimators that are based on one or two order statistics are a common way to estimate distribution quantiles based on the given samples. These estimators are robust, but their statistical efficiency is not always good enough. A more efficient alternative is the Harrell-Davis quantile estimator which uses a weighted sum of all order statistics. Whereas this approach provides more accurate estimations for the light-tailed distributions, it’s not robust. To be able to customize the trade-off between statistical efficiency and robustness, we could consider a trimmed modification of the Harrell-Davis quantile estimator. In this approach, we discard order statistics with low weights according to the highest density interval of the beta distribution.


Read more


Optimal window of the trimmed Harrell-Davis quantile estimator, Part 2: Trying Planck-taper window



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In the previous post, I discussed the problem of non-smooth quantile-respectful density estimation (QRDE) which is generated by the trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width. I assumed that non-smoothness was caused by a non-smooth rectangular window which was used to build the truncated beta distribution. In this post, we are going to try another option: the Planck-taper window.



Read more


Optimal window of the trimmed Harrell-Davis quantile estimator, Part 1: Problems with the rectangular window



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In the previous post, we have obtained a nice version of the trimmed Harrell-Davis quantile estimator which provides an opportunity to get a nice trade-off between robustness and statistical efficiency of quantile estimations. Unfortunately, it has a severe drawback. If we build a quantile-respectful density estimation based on the suggested estimator, we won’t get a smooth density function as in the case of the classic Harrell-Davis quantile estimator:


In this blog post series, we are going to find a way to improve the trimmed Harrell-Davis quantile estimator so that it gives a smooth density function and keeps its advantages in terms of robustness and statistical efficiency.


Read more


Beta distribution highest density interval of the given width



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In one of the previous posts, I discussed the idea of the trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width. Since the Harrell-Davis quantile estimator uses the Beta distribution, we should be able to find the beta distribution highest density interval of the given width. In this post, I will show how to do this.


Read more


Quantile estimators based on k order statistics, Part 8: Winsorized Harrell-Davis quantile estimator



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In the previous post, we have discussed the trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of size \(\sqrt{n}/n\). This quantile estimator showed a decent level of statistical efficiency. However, the research wouldn’t be complete without comparison with the winsorized modification. Let’s fix it!


Read more


Quantile estimators based on k order statistics, Part 7: Optimal threshold for the trimmed Harrell-Davis quantile estimator



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In the previous post, we have obtained a nice quantile estimator. To be specific, we considered a trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of the given size. The interval size is a parameter that controls the trade-off between statistical efficiency and robustness. While it’s nice to have the ability to control this trade-off, there is also a need for the default value, which could be used as a starting point when we have neither estimator breakdown point requirements nor prior knowledge about distribution properties.

After a series of unsuccessful attempts, it seems that I have found an acceptable solution. We should build the new estimator based on \(\sqrt{n}/n\) order statistics. In this post, I’m going to briefly explain the idea behind the suggested estimator and share some numerical simulations that compare the proposed estimator and the classic Harrell-Davis quantile estimator.


Read more


Quantile estimators based on k order statistics, Part 6: Continuous trimmed Harrell-Davis quantile estimator



Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator. A preprint with final results is available on arXiv: arXiv:2111.11776 [stat.ME].

In my previous post, I tried the idea of using the trimmed modification of the Harrell-Davis quantile estimator based on the highest density interval of the given width. The width was defined so that it covers exactly k order statistics (the width equals \((k-1)/n\)). I was pretty satisfied with the result and decided to continue evolving this approach. While “k order statistics” is a good mental model that described the trimmed interval, it doesn’t actually require an integer k. In fact, we can use any real number as the trimming percentage.

In this post, we are going to perform numerical simulations that check the statistical efficiency of the trimmed Harrell-Davis quantile estimator with different trimming percentages.


Read more