Sheather & Jones vs. unbiased cross-validation



In the post about the importance of kernel density estimation bandwidth, I reviewed several bandwidth selectors and showed their impact on the KDE. The classic selectors like Scott’s rule of thumb or Silverman’s rule of thumb are designed for the normal distribution and perform purely in non-parametric cases. One of the most significant caveats is that they can mask multimodality. The same problem is also relevant to the biased cross-validation method. Among all the bandwidth selectors available in R, only Sheather & Jones and unbiased cross-validation provide reliable results in the multimodal case. However, I always advocate using the Sheather & Jones method rather than the unbiased cross-validation approach.

In this post, I will show the drawbacks of the unbiased cross-validation method and what kind of problems we can get if we use it as a KDE bandwidth selector.


Read more


Resistance to the low-density regions: the Harrell-Davis median



In the previous post, we defined the resistance function that show sensitivity of the given estimator to the low-density regions. We also showed the resistance function plots for the mean and the sample median. In this post, we explore corresponding plots for the Harrell-Davis median.


Read more


Resistance to the low-density regions: the mean and the median



When we discuss resistant statistics, we typically assume resistance to extreme values. However, extreme values are not the only problem source that can violate usual assumptions about expected metric distribution. The low-density regions which often arise in multimodal distributions can also corrupt the results of the statistical analysis. In this post, I discuss this problem and introduce a measure of resistance to low-density regions.


Read more


Finite-sample Gaussian efficiency of the trimmed Harrell-Davis median estimator



In the previous post, we obtained the finite-sample Gaussian efficiency values of the sample median and the Harrell-Davis median. In this post, we extended these results and get the finite-sample Gaussian efficiency values of the trimmed Harrell-Davis median estimator based on the highest density interval of the width \(1/\sqrt{n}\).


Read more


Finite-sample Gaussian efficiency of the Harrell-Davis median estimator



In this post, we explore finite-sample and asymptotic Gaussian efficiency values of the sample median and the Harrell-Davis median.


Read more


Weighted quantile estimation for a weighted mixture distribution



Let \(\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}\) be a sample of size \(n\). We assign non-negative weight coefficients \(w_i\) with a positive sum for all sample elements:

\[\mathbf{w} = \{ w_1, w_2, \ldots, w_n \}, \quad w_i \geq 0, \quad \sum_{i=1}^{n} w_i > 0. \]

For simplification, we also consider normalized (standardized) weights \(\overline{\mathbf{w}}\):

\[\overline{\mathbf{w}} = \{ \overline{w}_1, \overline{w}_2, \ldots, \overline{w}_n \}, \quad \overline{w}_i = \frac{w_i}{\sum_{i=1}^{n} w_i}. \]

In the non-weighted case, we can consider a quantile estimator \(\operatorname{Q}(\mathbf{x}, p)\) that estimates the \(p^\textrm{th}\) quantile of the underlying distribution. We want to build a weighted quantile estimator \(\operatorname{Q}(\mathbf{x}, \mathbf{w}, p)\) so that we can estimate the quantiles of a weighed sample.

In this post, we consider a specific problem of estimating quantiles of a weighted mixture distribution.


Read more


Preprint announcement: 'Finite-sample Rousseeuw-Croux scale estimators'



Recently, I published a preprint of a paper ‘Finite-sample Rousseeuw-Croux scale estimators’. It’s based on a series of my research notes.

The paper preprint is available on arXiv: arXiv:2209.12268 [stat.ME]. The paper source code is available on GitHub: AndreyAkinshin/paper-frc. You can cite it as follows:

  • Andrey Akinshin (2022) “Finite-sample Rousseeuw-Croux scale estimators” arXiv:2209.12268

Abstract:

The Rousseeuw-Croux \(S_n\), \(Q_n\) scale estimators and the median absolute deviation \(\operatorname{MAD}_n\) can be used as consistent estimators for the standard deviation under normality. All of them are highly robust: the breakdown point of all three estimators is \(50\%\). However, \(S_n\) and \(Q_n\) are much more efficient than \(\operatorname{MAD}_n\): their asymptotic Gaussian efficiency values are \(58\%\) and \(82\%\) respectively compared to \(37\%\) for \(\operatorname{MAD}_n\). Although these values look impressive, they are only asymptotic values. The actual Gaussian efficiency of \(S_n\) and \(Q_n\) for small sample sizes is noticeably lower than in the asymptotic case.

The original work by Rousseeuw and Croux (1993) provides only rough approximations of the finite-sample bias-correction factors for \(S_n,\, Q_n\) and brief notes on their finite-sample efficiency values. In this paper, we perform extensive Monte-Carlo simulations in order to obtain refined values of the finite-sample properties of the Rousseeuw-Croux scale estimators. We present accurate values of the bias-correction factors and Gaussian efficiency for small samples (\(n \leq 100\)) and prediction equations for samples of larger sizes.


Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 3



In the previous posts (1, 2), I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution, the exponential distribution, and the Cauchy distribution. In this post, I build these sensitivity curves for some additional distributions.


Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 2



In the previous post, I have explored the sensitivity curves of the Harrell-Davis quantile estimator on the normal distribution. In this post, I continue the same investigation on the exponential and Cauchy distributions.


Read more


Sensitivity curve of the Harrell-Davis quantile estimator, Part 1



The Harrell-Davis quantile estimator is an efficient replacement for the traditional quantile estimator, especially in the case of light-tailed distributions. Unfortunately, it is not robust: its breakdown point is zero. However, the breakdown point is not the only descriptor of robustness. While the breakdown point describes the portion of the distribution that should be replaced by arbitrary large values to corrupt the estimation, it does not describe the actual impact of finite outliers. The arithmetic mean also has the breakdown point of zero, but the practical robustness of the mean and the Harrell-Davis quantile estimator are not the same. The Harrell-Davis quantile estimator is an L-estimator that assigns extremely low weights to sample elements near the tails (especially, for reasonably large sample sizes). Therefore, the actual impact of potential outliers is not so noticeable. In this post, we use the standardized sensitivity curve to evaluate this impact.


Read more