## Central limit theorem and log-normal distribution

It is inconvenient to work with samples from a distribution of unknown form. Therefore, researchers often switch to considering the sample mean value and hope that thanks to the central limit theorem, the distribution of the sample means should be approximately normal. They say that if we consider samples of size $$n \geq 30$$, we can expect practically acceptable convergence to normality thanks to Berry–Esseen theorem. Indeed, this statement is almost valid for many real data sets. However, we can actually expect the applicability of this approach only for light-tailed distributions. In the case of heavy-tailed distributions, converging to normality is so slow, that we cannot imply the normality assumption for the distribution of the sample means. In this post, I provide an illustration of this effect using the log-normal distribution.

## Hodges-Lehmann Gaussian efficiency: location shift vs. shift of locations

Let us consider two samples $$\mathbf{x} = (x_1, x_2, \ldots, x_n)$$ and $$\mathbf{y} = (y_1, y_2, \ldots, y_m)$$. The one-sample Hodges-Lehman location estimator is defined as the median of the Walsh (pairwise) averages:

$\operatorname{HL}(\mathbf{x}) = \underset{1 \leq i \leq j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right), \quad \operatorname{HL}(\mathbf{y}) = \underset{1 \leq i \leq j \leq m}{\operatorname{median}} \left(\frac{y_i + y_j}{2} \right).$

For these two samples, we can also define the shift between these two estimations:

$\Delta_{\operatorname{HL}}(\mathbf{x}, \mathbf{y}) = \operatorname{HL}(\mathbf{x}) - \operatorname{HL}(\mathbf{y}).$

The two-sample Hodges-Lehmann location shift estimator is defined as the median of pairwise differences:

$\operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right).$

Previously, I already compared the location shift estimator with the difference of median estimators (1, 2). In this post, I compare the difference between two location estimations and the shift estimations in terms of Gaussian efficiency. Before I started this study, I expected that $$\operatorname{HL}$$ should be more efficient than $$\Delta_{\operatorname{HL}}$$. Let us find out if my intuition is correct or not!

## Thoughts on automatic statistical methods and broken assumptions

In the old times of applied statistics existence, all statistical experiments used to be performed by hand. In manual investigations, an investigator is responsible not only for interpreting the research results but also for the applicability validation of the used statistical approaches. Nowadays, more and more data processing is performed automatically on enormously huge data sets. Due to the extraordinary number of data samples, it is often almost impossible to verify each output individually using human eyes. Unfortunately, since we typically have no full control over the input data, we cannot guarantee certain assumptions that are required by classic statistical methods. These assumptions can be violated not only due to real-life phenomena we were not aware of during the experiment design stage, but also due to data corruption. In such corner cases, we may get misleading results, wrong automatic decisions, unacceptably high Type I/II error rates, or even a program crash because of a division by zero or another invalid operation. If we want to make an automatic analysis system reliable and trustworthy, the underlying mathematical procedures should correctly process malformed data. The normality assumption is probably the most popular one. There are well-known methods of robust statistics that focus only on slight deviations from normality and the appearance of extreme outliers. However, it is only a violation of one specific consequence from the normality assumption: light-tailedness. In practice, this sub-assumption is often interpreted as “the probability of observing extremely large outliers is negligible.” Meanwhile, there are other implicit derived sub-assumptions: continuity (we do not expect tied values in the input samples), symmetry (we do not expect highly-skewed distributions), unimodality (we do not expect multiple modes), nondegeneracy (we do not expect all sample values to be equal), sample size sufficiency (we do not expect extremely small samples like single-element samples), and others.

## Ratio estimator based on the Hodges-Lehmann approach

For two samples $$\mathbf{x} = ( x_1, x_2, \ldots, x_n )$$ and $$\mathbf{y} = ( y_1, y_2, \ldots, y_m )$$, the Hodges-Lehmann location shift estimator is defined as follows:

$\operatorname{HL}(\mathbf{x}, \mathbf{y}) = \underset{1 \leq i \leq n,\,\, 1 \leq j \leq m}{\operatorname{median}} \left(x_i - y_j \right).$

Now, let us consider the problem of estimating the ratio of the location measures instead of the shift between them. While there are multiple approaches to providing such an estimation, one of the options that can be considered is based on the Hodges-Lehmann ideas.

## Weighted Mann-Whitney U test, Part 2

Previously, I suggested a weighted version of the Mann–Whitney $$U$$ test. The distribution of the weighted normalized $$U_\circ^\star$$ can be obtained via bootstrap. However, it is always nice if we can come up with an exact solution for the statistic distribution or at least provide reasonable approximations. In this post, we start exploring this distribution.

## Exploring the power curve of the Cucconi test

The Cucconi test is a nonparametric two-sample test that compares both location and scale. It is a classic example of the family of tests that perform such a comparison simultaneously instead of combining the results of a location test and a scale test. Intuitively, such an approach should fit well unimodal distributions. Moreover, it has the potential to outperform more generic nonparametric tests that do not rely on the unimodality assumption.

In this post, we briefly show the equations behind the Cucconi test and present a power curve that compares it with the Student’s t-test and the Mann-Whitney U test under normality.

## Parametric, Nonparametric, Robust, and Defensive statistics

Recently, I started writing about defensive statistics. The methodology allows having parametric assumptions, but it adjusts statistical methods so that they continue working even in the case of huge deviations from the declared assumptions. This idea sounds quite similar to nonparametric and robust statistics. In this post, I briefly explain the difference between different statistical methodologies.

## Insidious implicit statistical assumptions

Recently, I was rereading “Robust Statistics: The Approach Based on Influence Functions” by Frank Hampel et al. and I found this quote about the difference between robust and nonparametric statistics (page 9):

Robust statistics considers the effects of only approximate fulfillment of assumptions, while nonparametric statistics makes rather weak but nevertheless strict assumptions (such as continuity of distribution or independence).

This statement may sound obvious. Unfortunately, facts that are presumably obvious in general are not always so obvious at the moment. When a researcher works with specific types of distributions for a long time, the properties of these distributions may be transformed into implicit assumptions. This implicitness can be pretty dangerous. If an assumption is explicitly declared, it can become a starting point for a discussion on how to handle violations of this assumption. The implicit assumptions are hidden and therefore conceal potential issues in cases when the collected data do not meet our expectations.

A switch from parametric to nonparametric methods is sometimes perceived as a rejection of all assumptions. Such a perception can be hazardous. While the original parametric assumption is actually neglected, many researchers continue to act like the implicit consequences of this assumption are still valid.

Since normality is the most popular parametric assumption, I would like to briefly discuss connected implicit assumptions that are often perceived not as non-validated hypotheses, but as essential properties of the collected data.

## Four main books on robust statistics

Robust statistics is a practical and pragmatic branch of statistics. If you want to design reliable and trustworthy statistical procedures, the knowledge of robust statistics is essential. Unfortunately, it’s a challenging topic to learn.

In this post, I share my favorite books on robust statistics. I cannot pick my favorite one: each book is good in its own way, and all of them complement each other. I am returning to these books periodically to reinforce and expand my understanding of the topic.