Fence-based outlier detectors, Part 2

In the previous post, I discussed different fence-based outlier detectors. In this post, I show some examples of these detectors with different parameters.

Read more

Fence-based outlier detectors, Part 1

In previous posts, I discussed properties of Tukey’s fences and asymmetric decile-based outlier detector (Part 1, Part 2). In this post, I discuss the generalization of fence-based outlier detectors.

Read more

Publication announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'

Since the beginning of previous year, I have been working on building a quantile estimator that provides an optimal trade-off between statistical efficiency and robustness. At the end of the year, I published the corresponding preprint where I presented a description of such an estimator: arXiv:2111.11776 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-thdqe.

Finally, the paper was published in Communications in Statistics - Simulation and Computation. You can cite it as follows:

  • Andrey Akinshin (2022) Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width, Communications in Statistics - Simulation and Computation, DOI: 10.1080/03610918.2022.2050396

Read more

Asymmetric decile-based outlier detector, Part 2

In the previous post, I suggested an asymmetric decile-based outlier detector as an alternative to Tukey’s fences. In this post, we run some numerical simulations to check out the suggested outlier detector in action.

Read more

Asymmetric decile-based outlier detector, Part 1

In the previous post, I covered some problems with the outlier detector based on Tukey fences. Mainly, I discussed the probability of observing outliers using Tukey’s fences with different factors under different distributions. However, it’s not the only problem with this approach.

Since Tukey’s fences are based on quartiles, under multimodal distributions, we could get a situation when 50% of all sample elements are marked as outliers. Also, Tukey’s fences are designed for symmetric distributions, so we could get strange results with asymmetric distributions.

In this post, I want to suggest an asymmetric outlier detector based on deciles which mitigates this problem.

Read more

Probability of observing outliers using Tukey's fences

Tukey’s fences is one of the most popular simple outlier detectors for one-dimensional number arrays. This approach assumes that for a given sample, we calculate first and third quartiles (\(Q_1\) and \(Q_3\)), and mark all the sample elements outside the interval

\[[Q_1 - k (Q_3 - Q_1),\, Q_3 + k (Q_3 - Q_1)] \]

as outliers. Typical recommendation for \(k\) is \(1.5\) for “regular” outliers and \(3.0\) for “far outliers”. Here is a box plot example for a sample taken from the standard normal distributions (sample size is \(1000\)):

As we can see, 11 elements were marked as outliers (shown as dots). Is it an expected result or not? The answer depends on your goals. There is no single definition of an outlier. In fact, the chosen outlier detector provides a unique outlier definition.

In my applications, I typically consider outliers as rare events that should be investigated. When I detect too many outliers, all such reports become useless noise. For example, on the above image, I wouldn’t treat any of the sample elements as outliers. However, If we add \(10.0\) to this sample, this element is an obvious outlier (which will be the only one):

Thus, an important property of an outlier detector is the “false positive rate”: the percentage of samples with detected outliers which I wouldn’t treat as outliers. In this post, I perform numerical simulations that show the probability of observing outliers using Tukey’s fences with different \(k\) values.

Read more

Gamma effect size powered by the middle non-zero quantile absolute deviation

In previous posts, I covered the concept of the gamma effect size. It’s a nonparametric effect size which is consistent with Cohen’s d under the normal distribution. However, the original definition has drawbacks: this statistic becomes zero if half of the sample elements are equal to each other. Last time, I suggested) a workaround for this problem: we can replace the median absolute deviation by the quantile absolute deviation. Unfortunately, this trick requires parameter tuning: we should choose a proper quantile position to make this approach work. Today I want to suggest a strategy that provides a way to make a generic choice: we can use the middle non-zero quantile absolute deviation.

Read more

Middle non-zero quantile absolute deviation

Median absolute deviation (\(\operatorname{MAD}\)) around the median is a popular robust measure of statistical dispersion. Unfortunately, if we work with discrete distributions, we could get zero \(\operatorname{MAD}\) values. It could bring some problems if we use \(\operatorname{MAD}\) as a denominator. Such a problem is also relevant to some other quantile-based measures of dispersion like interquartile range (\(\operatorname{IQR}\)).

This problem could be solved using the quantile absolute deviation around the median. However, it’s not always clear how to choose the right quantile to estimate. In this post, I’m going to suggest a choosing approach that is consistent with the classic \(\operatorname{MAD}\) under continuous distributions (and samples without tied values).

Read more

Unbiased median absolute deviation based on the trimmed Harrell-Davis quantile estimator

The median absolute deviation (\(\operatorname{MAD}\)) is a robust measure of scale. For a sample \(x = \{ x_1, x_2, \ldots, x_n \}\), it’s defined as follows:

\[\operatorname{MAD}_n = C_n \cdot \operatorname{median}(|x - \operatorname{median}(x)|) \]

where \(\operatorname{median}\) is a median estimator, \(C_n\) is a scale factor. Using the right scale factor, we can use \(\operatorname{MAD}\) as a consistent estimator for the estimation of the standard deviation under the normal distribution. For huge samples, we can use the asymptotic value of \(C_n\) which is

\[C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056. \]

For small samples, we should use adjusted values \(C_n\) which depend on the sample size. However, \(C_n\) depends not only on the sample size but also on the median estimator. I have already covered how to obtain this values for the traditional median estimator and the Harrell-Davis median estimator. It’s time to get the \(C_n\) values for the trimmed Harrell-Davis median estimator.

Read more

Median absolute deviation vs. Shamos estimator

There are multiple ways to estimate statistical dispersion. The standard deviation is the most popular one, but it’s not robust: a single outlier could heavily corrupt the results. Fortunately, we have robust measures of dispersions like the median absolute deviation and the Shamos estimator. In this post, we perform numerical simulations and compare these two estimators on different distributions and sample sizes.

Read more