## Avoiding over-trimming with the trimmed Harrell-Davis quantile estimator

Previously, I already discussed the trimmed modification of the Harrell-Davis quantile estimator several times. I performed several numerical simulations that compare the statistical efficiency of this estimator with the efficiency of the classic Harrell-Davis quantile estimator (HDQE) and its winsorized modification; I showed how we can improve the efficiency using custom trimming strategies and how to choose a good trimming threshold value.

In the heavy-tailed cases, the trimmed HDQE provides better estimations than the classic HDQE because of its higher breakdown point. However, in the light-tailed cases, we could get efficiency that is worse than the baseline Hyndman-Fan Type 7 (HF7) quantile estimator. In many cases, such an effect arises because of the over-trimming effect. If the trimming percentage is too high or if the evaluated quantile is too far from the median, the trimming strategy based on the highest-density interval may lead to an estimation that is based on single order statistics. In this case, we get an efficiency level similar to the Hyndman-Fan Type 1-3 quantile estimators (which are also based on single order statistics). In the light-tailed case, such a result is less preferable than Hyndman-Fan Type 4-9 quantile estimators (which are based on two subsequent order statistics).

In order to improve the situation, we could introduce the lower bound for the number of order statistics that contribute to the final quantile estimations. In this post, I look at some numerical simulations that compare trimmed HDQEs with different lower bounds.

## Optimal threshold of the trimmed Harrell-Davis quantile estimator

The traditional quantile estimators (which are based on 1 or 2 order statistics) have great robustness. However, the statistical efficiency of these estimators is not so great. The Harrell-Davis quantile estimator has much better efficiency (at least in the light-tailed case), but it’s not robust (because it calculates a weighted sum of all sample values). I already wrote a post about trimmed Harrell-Davis quantile estimator: this approach suggest dropping some of the low-weight sample values to improve robustness (keeping good statistical efficiency). I also perform a numerical simulations that compare efficiency of the original Harrell-Davis quantile estimator against its trimmed and winsorized modifications. It’s time to discuss how to choose the optimal trimming threshold and how it affects the estimator efficiency.

## Estimating quantile confidence intervals: Maritz-Jarrett vs. jackknife

When it comes to estimating quantiles of the given sample, my estimator of choice is the Harrell-Davis quantile estimator (to be more specific, its trimmed version). If I need to get a confidence interval for the obtained quantiles, I use the Maritz-Jarrett method because it provides a decent coverage percentage. Both approaches work pretty nicely together.

However, in the original paper by Harrell and Davis (1982), the authors suggest using the jackknife variance estimator in order to get the confidence intervals. The obvious question here is which approach better: the Maritz-Jarrett method or the jackknife estimator? In this post, I perform a numerical simulation that compares both techniques using different distributions.

## Using Kish's effective sample size with weighted quantiles

In my previous posts, I described how to calculate weighted quantiles and their confidence intervals using the Harrell-Davis quantile estimator. This powerful technique allows applying quantile exponential smoothing and dispersion exponential smoothing for time series in order to get its moving properties.

When we work with weighted samples, we need a way to calculate the effective samples size. Previously, I used the sum of all weights normalized by the maximum weight. In most cases, it worked OK.

Recently, Ben Jann pointed out that it would be better to use the Kish’s formula to calculate the effective sample size. In this post, you find the formula and a few numerical simulations that illustrate the actual impact of the underlying sample size formula.

## Partial binning compression of performance series

Let’s start with a problem from real life. Imagine we have thousands of application components that should be initialized. We care about the total initialization time of the whole application, so we want to automatically track the slowest components using a continuous integration (CI) system. The easiest way to do it is to measure the initialization time of each component in each CI build and save all the measurements to a database. Unfortunately, if the total number of components is huge, the overall artifact size may be quite extensive. Thus, this approach may introduce an unwanted negative impact on the database size and data processing time.

However, we don’t actually need all the measurements. We want to track only the slowest components. Typically, it’s possible to introduce a reasonable threshold that defines such components. For example, we can say that all components that are initialized in less than 1ms are “fast enough,” so there is no need to know the exact initialization time for them. Since these time values are insignificant, we can just omit all the measurements below the given thresholds. This allows to significantly reduce the data traffic without losing any important information.

The suggested trick can be named partial binning compression. Indeed, we introduce a single bin (perform binning) and omit all the values inside this bin (perform compression). On the other hand, we don’t build an honest histogram since we keep all the raw values outside the given bin (the binning is partial).

Let’s discuss a few aspects of using partial binning compression.

## Calculating gamma effect size for samples with zero median absolute deviation

In previous posts, I discussed the gamma effect size which is a Cohen’s d-consistent nonparametric and robust measure of the effect size. Also, I discussed various ways to customize this metric and adjust it to different kinds of business requirements. In this post, I want to briefly cover one more corner case that requires special adjustments. We are going to discuss the situation when the median absolute deviation is zero.

## Discrete performance distributions

When we collect software performance measurements, we get a bunch of time intervals. Typically, we tend to interpret time values as continuous values. However, the obtained values are actually discrete due to the limited resolution of our measurement tool. In simple cases, we can treat these discrete values as continuous and get meaningful results. Unfortunately, discretization may produce strange phenomena like pseudo-multimodality or zero dispersion. If we want to set up a reliable system that automatically analyzes such distributions, we should be aware of such problems so we could correctly handle them.

In this post, I want to share a few of discretization problems in real-life performance data sets (based on the Rider performance tests).

## Customization of the nonparametric Cohen's d-consistent effect size

One year ago, I publish a post called Nonparametric Cohen's d-consistent effect size. During this year, I got a lot of internal and external feedback from my own statistical experiments and people who tried to use the suggested approach. It seems that the nonparametric version of Cohen’s d works much better with real-life not-so-normal data. While the classic Cohen’s d based on the non-robust arithmetic mean and the non-robust standard deviation can be easily corrupted by a single outlier, my approach is much more resistant to unexpected extreme values. Also, it allows exploring the difference between specific quantiles of considered samples, which can be useful in the non-parametric case.

However, I wasn’t satisfied with the results of all of my experiments. While I still like the basic idea (replace the mean with the median; replace the standard deviation with the median absolute deviation), it turned out that the final results heavily depend on the used quantile estimator. To be more specific, the original Harrell-Davis quantile estimator is not always optimal; in most cases, it’s better to replace it with its trimmed modification. However, the particular choice of the quantile estimators depends on the situation. Also, the consistency constant for the median absolute deviation should be adjusted according to the current sample size and the used quantile estimator. Of course, it also can be replaced by other dispersion estimators that can be used as consistent estimators of the standard deviation.

In this post, I want to get a brief overview of possible customizations of the suggested metrics.

## Robust alternative to statistical efficiency

Statistical efficiency is a common measure of the quality of an estimator. Typically, it’s expressed via the mean square error ($$\operatorname{MSE}$$). For the given estimator $$T$$ and the true parameter value $$\theta$$, the $$\operatorname{MSE}$$ can be expressed as follows:

$\operatorname{MSE}(T) = \operatorname{E}[(T-\theta)^2]$

In numerical simulations, the $$\operatorname{MSE}$$ can’t be used as a robust metric because its breakdown point is zero (a corruption of a single measurement leads to a corrupted result). Typically, it’s not a problem for light-tailed distributions. Unfortunately, in the heavy-tailed case, the $$\operatorname{MSE}$$ becomes an unreliable and unreproducible metric because it can be easily spoiled by a single outlier.

I suggest an alternative way to compare statistical estimators. Instead of using non-robust $$\operatorname{MSE}$$, we can use robust quantile estimations of the absolute error distribution. In this post, I want to share numerical simulations that show a problem of irreproducible $$\operatorname{MSE}$$ values and how they can be replaced by reproducible quantile values.

## Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategies

Let’s say we want to estimate the median based on a small sample (3 $$\leq n \leq 7$$) from a right-skewed heavy-tailed distribution with high statistical efficiency.

The traditional median estimator is the most robust estimator, but it’s not the most efficient one. Typically, the Harrell-Davis quantile estimator provides better efficiency, but it’s not robust (its breakdown point is zero), so it may have worse efficiency in the given case. The winsorized and trimmed modifications of the Harrell-Davis quantile estimator provide a good trade-off between efficiency and robustness, but they require a proper winsorizing/trimming rule. A reasonable choice of such a rule for medium-size samples is based on the highest density interval of the Beta function (as described here). Unfortunately, this approach may be suboptimal for small samples. E.g., if we use the 99% highest density interval to estimate the median, it starts to trim sample values only for $$n \geq 8$$.

In this post, we are going to discuss custom winsorizing/trimming strategies for special cases of the quantile estimation problem.