Preprint announcement: 'Finite-sample Rousseeuw-Croux scale estimators'


Recently, I published a preprint of a paper ‘Finite-sample Rousseeuw-Croux scale estimators’. It’s based on a series of my research notes.

The paper preprint is available on arXiv: arXiv:2209.12268 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-frc. You can cite it as follows:

  • Andrey Akinshin (2022) “Finite-sample Rousseeuw-Croux scale estimators” arXiv:2209.12268

Abstract:

The Rousseeuw-Croux $S_n$, $Q_n$ scale estimators and the median absolute deviation $\operatorname{MAD}_n$ can be used as consistent estimators for the standard deviation under normality. All of them are highly robust: the breakdown point of all three estimators is $50\%$. However, $S_n$ and $Q_n$ are much more efficient than $\operatorname{MAD}_n$: their asymptotic Gaussian efficiency values are $58\%$ and $82\%$ respectively compared to $37\%$ for $\operatorname{MAD}_n$. Although these values look impressive, they are only asymptotic values. The actual Gaussian efficiency of $S_n$ and $Q_n$ for small sample sizes is noticeably lower than in the asymptotic case.

The original work by Rousseeuw and Croux (1993) provides only rough approximations of the finite-sample bias-correction factors for $S_n,\, Q_n$ and brief notes on their finite-sample efficiency values. In this paper, we perform extensive Monte-Carlo simulations in order to obtain refined values of the finite-sample properties of the Rousseeuw-Croux scale estimators. We present accurate values of the bias-correction factors and Gaussian efficiency for small samples ($n \leq 100$) and prediction equations for samples of larger sizes.

Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 3


In the previous posts (1, 2), I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution, the exponential distribution, and the Cauchy distribution. In this post, I build these sensitivity curves for some additional distributions.

Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 2


In the previous post, I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution. In this post, I continue the same investigation on the exponential and Cauchy distributions.

Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 1


The Harrell-Davis quantile estimator is an efficient replacement for the traditional quantile estimator, especially in the case of light-tailed distributions. Unfortunately, it is not robust: its breakdown point is zero. However, the breakdown point is not the only descriptor of robustness. While the breakdown point describes the portion of the distribution that should be replaced by arbitrary large values to corrupt the estimation, it does not describe the actual impact of finite outliers. The arithmetic mean also has the breakdown point of zero, but the practical robustness of the mean and the Harrell-Davis quantile estimator are not the same. The Harrell-Davis quantile estimator is an L-estimator that assigns extremely low weights to sample elements near the tails (especially, for reasonably large sample sizes). Therefore, the actual impact of potential outliers is not so noticeable. In this post, we use the standardized sensitivity curve to evaluate this impact.

Read more


Weighted quantile estimators for exponential smoothing and mixture distributions


There are various ways to estimate quantiles of weighted samples. The proper choice of the most appropriate weighted quantile estimator depends not only on the own estimator properties but also on the goal.

Let us consider two problems:

  1. Estimating quantiles of a weighted mixture distribution.
    In this problem, we have a weighted mixture distribution given by $F = \sum_{i=1}^m w_i F_i$. We collect samples $\mathbf{x_1}, \mathbf{x_2}, \ldots, \mathbf{x_m}$ from $F_1, F_2, \ldots F_m$, and want to estimate quantile function $F^{-1}$ of the mixture distribution based on the given samples.
  2. Quantile exponential smoothing.
    In this problem, we have a time series $\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$. We want to describe the distribution “at the end” of this time series. The latest series element $x_n$ is the most “actual” one, but we cannot build a distribution based on a single element. Therefore, we have to consider more elements at the end of $\mathbf{x}$. However, if we take too many elements, we may corrupt the estimations due to obsolete measurements. To resolve this problem, we can assign weights to all elements according to the exponential law and estimate weighted quantiles.

In both problems, the usage of weighted quantile estimators looks like a reasonable solution. However, in each problem, we have different expectations of the estimator behavior. In this post, we provide an example that illustrates the difference in these expectations.

Read more


The Huggins-Roy family of effective sample sizes


When we work with weighted samples, it’s essential to introduce adjustments for the sample size. Indeed, let’s consider two following weighted samples:

$$ \mathbf{x}_1 = \{ x_1, x_2, \ldots, x_n \}, \quad \mathbf{w}_1 = \{ w_1, w_2, \ldots, w_n \}, $$ $$ \mathbf{x}_2 = \{ x_1, x_2, \ldots, x_n, x_{n+1} \}, \quad \mathbf{w}_2 = \{ w_1, w_2, \ldots, w_n, 0 \}. $$

Since the weight of $x_{n+1}$ in the second sample is zero, it’s natural to expect that both samples have the same set of properties. However, there is a major difference between $\mathbf{x}_1$ and $\mathbf{x}_2$: their sample sizes which are $n$ and $n+1$. In order to eliminate this difference, we typically introduce the effective sample size (ESS) which is estimated based on the list of weights.

There are various ways to estimate the ESS. In this post, we briefly discuss the Huggins-Roy’s family of ESS.

Read more


Finite-sample bias correction factors for Rousseeuw-Croux scale estimators


The Rousseeuw-Croux scale estimators $S_n$ and $Q_n$ are efficient alternatives to the median absolute deviation ($\operatorname{MAD}_n$). While all three estimators have the same breakdown point of $50\%$, $S_n$ and $Q_n$ have higher statistical efficiency than $\operatorname{MAD}_n$. The asymptotic Gaussian efficiency values of $\operatorname{MAD}_n$, $S_n$, and $Q_n$ are $37\%$, $58\%$, and $82\%$ respectively.

Using scale constants, we can make $S_n$ and $Q_n$ consistent estimators for the standard deviation under normality. The asymptotic values of these constants are well-known. However, for finite-samples, only approximated scale constants are known. In this post, we provide refined values of these constants with higher accuracy.

Read more


Preprint announcement: 'Quantile absolute deviation'


I have just published a preprint of a paper ‘Quantile absolute deviation’. It’s based on a series of my research notes that I have been writing since December 2020.

The paper preprint is available on arXiv: arXiv:2208.13459 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-qad. You can cite it as follows:

Abstract:

The median absolute deviation (MAD) is a popular robust measure of statistical dispersion. However, when it is applied to non-parametric distributions (especially multimodal, discrete, or heavy-tailed), lots of statistical inference issues arise. Even when it is applied to distributions with slight deviations from normality and these issues are not actual, the Gaussian efficiency of the MAD is only 37% which is not always enough.

In this paper, we introduce the quantile absolute deviation (QAD) as a generalization of the MAD. This measure of dispersion provides a flexible approach to analyzing properties of non-parametric distributions. It also allows controlling the trade-off between robustness and statistical efficiency. We use the trimmed Harrell-Davis median estimator based on the highest density interval of the given width as a complimentary median estimator that gives increased finite-sample Gaussian efficiency compared to the sample median and a breakdown point matched to the QAD.

As a rule of thumb, we suggest using two new measures of dispersion called the standard QAD and the optimal QAD. They give 54% and 65% of Gaussian efficiency having breakdown points of 32% and 14% respectively.

Read more


Standard trimmed Harrell-Davis median estimator


In one of the previous posts, I suggested a new measure of dispersion called the standard quantile absolute deviation around the median ($\operatorname{SQAD}$) which can be used as an alternative to the median absolute deviation ($\operatorname{MAD}$) as a consistent estimator for the standard deviation under normality. The Gaussian efficiency of $\operatorname{SQAD}$ is $54\%$ (comparing to $37\%$ for MAD), and its breakdown point is $32\%$ (comparing to $50\%$ for MAD). $\operatorname{SQAD}$ is a symmetric dispersion measure around the median: the interval $[\operatorname{Median} - \operatorname{SQAD}; \operatorname{Median} + \operatorname{SQAD}]$ covers $68\%$ of the distribution. In the case of the normal distribution, this corresponds to the interval $[\mu - \sigma; \mu + \sigma]$.

If we use $\operatorname{SQAD}$, we accept the breakdown point of $32\%$. This makes the sample median a non-optimal choice for the median estimator. Indeed, the sample median has high robustness (the breakdown point is $50\%$), but relatively poor Gaussian efficiency. If we use $\operatorname{SQAD}$, it doesn’t make sense to require a breakdown point of more than $32\%$. Therefore, we could trade the median robustness for efficiency and come up with a complementary measure of the median for $\operatorname{SQAD}$.

In this post, we introduce the standard trimmed Harrell-Davis median estimator which shares the breakdown point with $\operatorname{SQAD}$ and provides better finite-sample efficiency comparing to the sample median.

Read more


Optimal quantile absolute deviation


We consider the quantile absolute deviation around the median defined as follows:

$$ \newcommand{\E}{\mathbb{E}} \newcommand{\PR}{\mathbb{P}} \newcommand{\Q}{\operatorname{Q}} \newcommand{\OQAD}{\operatorname{OQAD}} \newcommand{\QAD}{\operatorname{QAD}} \newcommand{\median}{\operatorname{median}} \newcommand{\Exp}{\operatorname{Exp}} \newcommand{\SD}{\operatorname{SD}} \newcommand{\V}{\mathbb{V}} \QAD(X, p) = K_p \Q(|X - \median(X)|, p), $$

where $\Q$ is a quantile estimator, and $K_p$ is a scale constant which we use to make $\QAD(X, p)$ an asymptotically consistent estimator for the standard deviation under the normal distribution.

In this post, we get the exact values of the $K_p$ values, derive the corresponding equation for the asymptotic Gaussian efficiency of $\QAD(X, p)$, and find the point in which $\QAD(X, p)$ achieves the highest Gaussian efficiency.

Read more