## Sporadic noise problem in change point detection

We consider a problem of change point detection at the end of a time series. Let us say that we systematically monitor readings of an indicator, and we want to react to noticeable changes in the measured values as fast as possible. When there are no changes in the underlying distribution, any alerts about detected change points should be considered false positives. Typically, in such problems, we consider the i.i.d. assumption that claims that in the absence of change points, all the measurements are independent and identically distributed. Such an assumption significantly simplifies the mathematical model, but unfortunately, it is rarely fully satisfied in real life. If we want to build a reliable change point detection system, it is important to be aware of possible real-life artifacts that introduce deviations from the declared model. In this problem, I discuss the problem of the sporadic noise.

Read more

## Resistance to the low-density regions: the Hodges-Lehmann location estimator based on the Harrell-Davis quantile estimator

Previously, I have discussed the topic of the resistance to the low-density regions of various estimators including the Hodges-Lehmann location estimator (\(\operatorname{HL}\)). In general, \(\operatorname{HL}\) is a great estimator with great statistical efficiency and a decent breakdown point. Unfortunately, it has low resistance to the low-density regions around \(29^\textrm{th}\) and \(71^\textrm{th}\) percentiles, which may cause troubles in the case of multimodal distributions. I am trying to find a modification of \(\operatorname{HL}\) that performs almost the same as the original \(\operatorname{HL}\), but has increased resistance. One of the ideas I had was using the Harrell-Davis quantile estimator instead of the sample median to evaluate \(\operatorname{HL}\). Regrettably, this idea did not turn out to be successful: such an estimator has a resistance function similar to the original \(\operatorname{HL}\). I believe that it is important to share negative results, and therefore this post contains a bunch of plots, which illustrate results of relevant numerical simulations.

Read more

## Median vs. Hodges-Lehmann: compare efficiency under heavy-tailedness

In the previous post, I shared some thoughts on how to evaluate the statistical efficiency of estimators under heavy-tailed distributions. In this post, I apply described ideas to actually compare efficiency values of the Mean, the Sample Median, and the Hodges-Lehmann location estimator under various distributions.

Read more

## Thoughts about robustness and efficiency

Statistical efficiency is an essential characteristic, which has to be taken into account when we choose between different estimators. When the underlying distribution is a normal one or at least light-tailed, evaluation of the statistical efficiency typically is not so hard. However, when the underlying distribution is a heavy-tailed one, problems appear. The statistical efficiency is usually expressed via the mean squared error or via variance, which are not robust. Therefore, heavy-tailedness may lead to distorted or even infinite efficiency, which is quite impractical. So, how do we compare the efficiency of estimators under a heavy-tailed distribution? Let’s say we want to compare the efficiency of the mean and the median distribution. Under the normal distribution (so-called Gaussian efficiency), this task is trivial: we build the sampling mean distribution and the sampling median distribution, estimate the variance for each of them, and then get the ratio of these variances. However, if we are interested in the median, we are probably expecting some outliers. Most of the significant real-life outliers come from the heavy-tailed distributions. Therefore, Gaussian efficiency is not the most interesting metric. It makes sense to evaluate the efficiency of the considered estimators under various heavy-tailed distributions. Unfortunately, the variance is not a robust measure and is too sensitive to tails: if the sampling distribution is also not normal or even heavy-tailed, the meaningfulness of the true variance value decreases. It seems reasonable to consider alternative robust measures of dispersion. Which one should we choose? Maybe Median Absolute Deviation (MAD)? Well, the asymptotic Gaussian efficiency of MAD is only ~37%. And here we have the same problem: should we trust the Gaussian efficiency under heavy-tailedness? Therefore, we should first evaluate the efficiency of dispersion estimators. But we can’t do it without a previously chosen dispersion estimator! And could we truly express the actual relative efficiency between two estimators under tricky asymmetric multimodal heavy-tailed distributions using a single number?

Read more

## Finite-sample Gaussian efficiency: Quantile absolute deviation vs. Rousseeuw-Croux scale estimators

In this post, we discuss the finite-sample Gaussian efficiency of various robust dispersion estimators. The classic standard deviation has the highest possible Gaussian efficiency of \(100\%\), but it is not robust: a single outlier can completely destroy the estimation. A typical robust alternative to the standard deviation is the Median Absolute Deviation (\(\operatorname{MAD}\)). While the \(\operatorname{MAD}\) is highly robust (the breakdown point is \(50\%\)), it is not efficient: its asymptotic Gaussian efficiency is only \(37\%\). Common alternative to the \(\operatorname{MAD}\) is the Rousseeuw-Croux \(S_n\) and \(Q_n\) scale estimators that provide higher efficiency, keeping the breakdown point of \(50\%\). In one of my recent preprints, I introduced the concept of the Quantile Absolute Deviation (\(\operatorname{QAD}\)) and its specific cases: the Standard Quantile Absolute Deviation (\(\operatorname{SQAD}\)) and the Optimal Quantile Absolute Deviation (\(\operatorname{OQAD}\)). Let us review the finite-sample and asymptotic values of the Gaussian efficiency for these estimators.

Read more

## Mann-Whitney U test and heteroscedasticity

Mann-Whitney U test is a good nonparametric test, which mostly targets changes in locations. However, it doesn’t properly support all types of differences between the two distributions. Specifically, it poorly handles changes in variance. In this post, I briefly discuss its behavior in reaction to scaling a distribution without introducing location changes.

Read more

## Exploring the power curve of the Ansari-Bradley test

The Ansari-Bradley test is a popular rank-based nonparametric test for a difference in scale/dispersion parameters. In this post, we explore its power curve in a numerical simulation.

Read more

## Exploring the power curve of the Lepage test

Previously, I already discussed the Cucconi test. In this post, I continue the topic of nonparametric tests and check out the Lepage test.

Read more

## Weighted Hodges-Lehmann location estimator and mixture distributions

The classic non-weighted Hodges-Lehmann location estimator of a sample \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) is defined as follows:

\[\operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right), \]

where \(\operatorname{median}\) is the sample median. Previously, we have defined a weighted version of the Hodges-Lehmann location estimator as follows:

\[\operatorname{WHL}(\mathbf{x}, \mathbf{w}) = \underset{1 \leq i \leq j \leq n}{\operatorname{wmedian}} \left(\frac{x_i + x_j}{2},\; w_i \cdot w_j \right), \]

where \(\mathbf{w} = (w_1, w_2, \ldots, w_n)\) is the vector of weights, \(\operatorname{wmedian}\) is the weighted median. For simplicity, in the scope of the current post, Hyndman-Fan Type 7 quantile estimator is used as the base for the weighted median.

In this post, we consider a numerical simulation in which we compare sampling distribution of \(\operatorname{HL}\) and \(\operatorname{WHL}\) in a case of mixture distribution.

Read more

## Carling’s Modification of the Tukey's fences

Let us consider the classic problem of outlier detection in one-dimensional sample. One of the most popular approaches is Tukey’s fences, that defines the following range:

\[[Q_1 - k(Q_3 - Q_1);\; Q_3 + k(Q_3 - Q_1)], \]

where \(Q_1\) and \(Q_3\) are the first and the third quartiles of the given sample.

All the values outside the given range are classified as outliers. The typical values of \(k\) are \(1.5\) for “usual outliers” and \(3.0\) for “far out” outliers. In the classic Tukey’s fences approach, \(k\) is often a predefined constant. However, there are alternative approaches that define \(k\) dynamically based on the given sample. One of the possible variations of Tukey’s fences is Carling’s modification that defines \(k\) as follows:

\[k = \frac{17.63n - 23.64}{7.74n - 3.71}, \]

where \(n\) is the sample size.

In this post, we compare the classic Tukey’s fences with \(k=1.5\) and \(k=3.0\) against Carling’s modification.

Read more