## Thoughts about outlier removal and climate change

When it comes to outlier removal, people typically start discussing various techniques that remove outliers from the given sample. However, such a discussion should be started with the question, “why do we want to remove outliers?” Outliers may provide essential information about the underlying distribution, so we do not always want to discard them. If we blindly remove all the outliers, we may miss important insights. Before we start choosing the best outlier detector, we should understand the nature of the outlier existing. We should define what kind of values we recognize as outliers and what kind of useful information they can provide. If we have non-robust estimators that can be affected by these outliers, we may also consider replacing them with robust estimators that do not have such a problem. Meanwhile, additional analysis of extreme values can provide useful insights for anomaly detection and tail approximation.

To illustrate the danger of blind outlier removal, I would like to share a fragment from the book “Our Changing Climate” (1991) by R. Kandel:

Read more

## Nonparametric effect size: Cohen's d vs. Glass's delta

In the previous posts, I discussed the idea of nonparametric effect size measures consistent with Cohen’s d under normality. However, Cohen’s d is not always the best effect size measure, even in the normal case.

In this post, we briefly discuss a case study in which a nonparametric version of Glass’s delta is preferable than the previously suggested Cohen’s d-consistent measure.

Read more

## Trinal statistical thresholds

When we design a test for practical significance, which compares two samples, we should somehow express the threshold. The most popular options are the shift, the ratio, and the effect size. Unfortunately, if we have little information about the underlying distributions, it’s hard to get a reliable test based only on a single threshold. And it’s almost impossible to define a generic threshold that fits all situations. After struggling with a lot of different thresholding approaches, I came up with the idea of setting a trinal threshold that includes three individual thresholds for the shift, the ratio, and the effect size.

In this post, I show some examples in which a single threshold is not enough.

Read more

## Trimmed Hodges-Lehmann location estimator, Part 2: Gaussian efficiency

In the previous post, we introduced the trimmed Hodges-Lehman location estimator. For a sample \(\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}\), it is defined as follows:

\[\operatorname{THL}(\mathbf{x}, k) = \underset{k < i < j \leq n - k}{\operatorname{median}}\biggl(\frac{x_{(i)} + x_{(j)}}{2}\biggr). \]

We also derived the exact expression for its asymptotic and finite-sample breakdown point values. In this post, we explore its Gaussian efficiency.

Read more

## Trimmed Hodges-Lehmann location estimator, Part 1: breakdown point

For a sample \(\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}\), the Hodges-Lehmann location estimator is defined as follows:

\[\operatorname{HL}(\mathbf{x}) = \underset{i < j}{\operatorname{median}}\biggl(\frac{x_i + x_j}{2}\biggr). \]

Its asymptotic Gaussian efficiency is \(\approx 96\%\), while its asymptotic breakdown point is \(\approx 29\%\). This makes the Hodges-Lehmann location estimator a decent robust alternative to the mean.

While the Gaussian efficiency is quite impressive (almost as efficient as the mean), the breakdown point is not as great as in the case of the median (which has a breakdown point of \(50\%\)). Could we change this trade-off a little bit and make this estimator more robust, sacrificing a small portion of efficiency? Yes, we can!

In this post, I want to present the idea of the trimmed Hodges-Lehmann location estimator and provide the exact equation for its breakdown point.

Read more

## Median of the shifts vs. shift of the medians, Part 2: Gaussian efficiency

In the previous post, we discussed the difference between shifts of the medians and the Hodges-Lehmann location shift estimator. In this post, we conduct a simple numerical simulation to evaluate the Gaussian efficiency of these two estimators.

Read more

## Median of the shifts vs. shift of the medians, Part 1

Let us say that we have two samples
\(x = \{ x_1, x_2, \ldots, x_n \}\),
\(y = \{ y_1, y_2, \ldots, y_m \}\),
and we want to estimate the shift of locations between them.
In the case of the normal distribution, this task is quite simple
and has a lot of straightforward solutions.
However, in the nonparametric case, the location shift is an ambiguous metric
which heavily depends on the chosen estimator.
In the context of this post, we consider two approaches that may look similar.
The first one is the **s**hift of the **m**edians:

\[\newcommand{\DSM}{\Delta_{\operatorname{SM}}} \DSM = \operatorname{median}(y) - \operatorname{median}(x). \]

The second one of the median of all pairwise shifts,
also known as the **H**odges-**L**ehmann location shift estimator:

\[\newcommand{\DHL}{\Delta_{\operatorname{HL}}} \DHL = \operatorname{median}(y_j - x_i). \]

In the case of the normal distributions, these estimators are consistent. However, this post will show an example of multimodal distributions that lead to opposite signs of \(\DSM\) and \(\DHL\).

Read more

## Resistance to the low-density regions: the Hodges-Lehmann location estimator

In the previous posts, I discussed the concept of a resistance function that shows the sensitivity of the given estimator to the low-density regions. I already showed how this function behaves for the mean, the sample median, and the Harrell-Davis median. In this post, I explore this function for the Hodges-Lehmann location estimator.

Read more

## Kernel density estimation boundary correction: reflection (ggplot2 v3.4.0)

Kernel density estimation (KDE) is a popular way to approximate a distribution based on the given data.
However, it has several flaws.
One of the most significant flaws is that it extends the support of the distribution.
It is pretty unfortunate: even if we know the actual range of supported values,
KDE provides non-zero density values for the regions where no values exist.
It is obviously an inaccurate estimation.
The procedure of adjusting the KDE values according to the given boundaries is known as *boundary correction*.
As usual, there are plenty of available boundary correction strategies.

One such strategy was implemented in the
v3.4.0 update of
ggplot2 (a popular R package for plotting)
thanks to pull request #4013.
At the present moment, it supports a single boundary correction strategy called *reflection*.
In this post, we discuss this approach and see how it works in practice.

Read more

## Sheather & Jones vs. unbiased cross-validation

In the post about the importance of kernel density estimation bandwidth, I reviewed several bandwidth selectors and showed their impact on the KDE. The classic selectors like Scott’s rule of thumb or Silverman’s rule of thumb are designed for the normal distribution and perform purely in non-parametric cases. One of the most significant caveats is that they can mask multimodality. The same problem is also relevant to the biased cross-validation method. Among all the bandwidth selectors available in R, only Sheather & Jones and unbiased cross-validation provide reliable results in the multimodal case. However, I always advocate using the Sheather & Jones method rather than the unbiased cross-validation approach.

In this post, I will show the drawbacks of the unbiased cross-validation method and what kind of problems we can get if we use it as a KDE bandwidth selector.

Read more