## Weighted modification of the Hodges-Lehmann location estimator

The classic Hodges-Lehmann location estimator is a robust, non-parametric statistic used as a measure of the central tendency. For a sample $$\mathbf{x} = \{ x_1, x_2, \ldots, x_n \}$$, it is defined as follows:

$\operatorname{HL}(\mathbf{x}) = \underset{1 \leq i < j \leq n}{\operatorname{median}} \left(\frac{x_i + x_j}{2} \right).$

This estimator works great for non-weighted samples (its asymptotic Gaussian efficiency is $$\approx 96\%$$, and its asymptotic breakdown point is $$\approx 29\%$$). However, in real-world applications, data points may have varying importance or relevance. For example, in finance, different stocks may have different market capitalizations, which can impact the overall performance of an index. In social science research, survey responses may be weighted based on demographic representation to ensure that the final results are more generalizable. In software performance measurements, the observations may be collected from different source code revisions, some of which may be obsolete. In these cases, the classic $$\operatorname{HL}$$-measure is not suitable, as it treats each data point equally.

We can overcome this problem using weighted samples to obtain more accurate and meaningful central tendency estimates. Unfortunately, there is no well-established definition of the weighted Hodges-Lehmann location estimator. In this blog post, we introduce such a definition so that we can apply this estimator to weighted samples keeping it compatible with the original version.

## Performance stability of GitHub Actions

Nowadays, GitHub Actions is one of the most popular free CI systems. It’s quite convenient to use it to run unit and integration tests. However, some developers try to use it to run benchmarks and performance tests. Unfortunately, default GitHub Actions build agents do not provide a consistent execution environment from the performance point of view. Therefore, performance measurements from different builds can not be compared. This makes it almost impossible to set up reliable performance tests based on the default GitHub Actions build agent pool.

So, it’s expected that the execution environments are not absolutely identical. But how bad is the situation? What’s the maximum difference between performance measurements from different builds? Is there a chance that we can play with thresholds and utilize GitHub Actions to detect at least major performance degradations? Let’s find out!

## p-value distribution of the Brunner–Munzel test in the finite case

In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.

## Comparing statistical power of the Mann-Whitney U test and the Brunner-Munzel test

In this post, we perform a short numerical simulation to compare the statistical power of the Mann-Whitney U test and the Brunner-Munzel test under normality for various sample sizes and significance levels.

## p-value distribution of the Mann–Whitney U test in the finite case

When we work with null hypothesis significance testing and the null hypothesis is true, the distribution of observed p-value is asymptotically uniform. However, the distribution shape is not always uniform in the finite case. For example, when we work with rank-based tests like the Mann–Whitney U test, the distribution of the p-values is discrete with a limited set of possible values. This should be taken into account when we design a testing procedure for small samples and choose the significance level.

Previously, we already discussed the minimum reasonable significance level of the Mann-Whitney U test for small samples. In this post, we explore the full distribution of the p-values for this case.

## Corner case of the Brunner–Munzel test

The Brunner–Munzel test is a nonparametric significance test, which can be considered an alternative to the Mann–Whitney U test. However, the Brunner–Munzel test has a corner case that can cause some practical issues with applying this test to real data. In this post, I briefly discuss the test itself and the corresponding corner case.

## Examples of the Mann–Whitney U test misuse cases

The Mann–Whitney U test is one of the most popular nonparametric statistical tests. Its alternative hypothesis claims that one distribution is stochastically greater than the other. However, people often misuse this test and try to apply it to check if two nonparametric distributions are not identical or that there is a difference in distribution medians (while there are no additional assumptions on the shapes of the distributions). In this post, I show several cases in which the Mann–Whitney U test is not applicable for comparing two distributions.

## Types of finite-sample consistency with the standard deviation

Let us say we have a robust dispersion estimator $$\operatorname{T}(X)$$. If it is asymptotically consistent with the standard deviation, we can use such an estimator as a robust replacement for the standard deviation under normality. Thanks to asymptotical consistency, we can use the estimator “as is” for large samples. However, if the number of sample elements is small, we typically need finite-sample bias-correction factors to make the estimator unbiased. Here we should clearly understand what kind of consistency we need.

There are various ways to estimate the standard deviation. Let us consider a sample of random variables $$X = \{ X_1, X_2, \ldots, X_n \}$$. The most popular equation of the standard deviation is given by

$s(X) = \sqrt{\frac{1}{n - 1} \sum_{i=1}^n (X_i - \overline{X})^2}.$

Using this definition, we can get an unbiased estimator for the population variance: $$\mathbb{E}[s^2(X)] = 1$$. However, it is a biased estimator for the population standard deviation: $$\mathbb{E}[s(X)] \neq 1$$. To obtain to corresponding unbiased estimator, we should use $$s(\mathbf{x}) \cdot c_4(n)$$, where $$c_4(n)$$ is a correction factor defined as follows:

$c_4(n) = \sqrt{\frac{2}{n-1}} \cdot \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n-1}{2}\right)}.$

When we define finite-sample bias-correction factors for a robust standard deviation replacement, we should choose which kind of consistency we need. In this post, I briefly explore available options.

## Thoughts about outlier removal and ozone holes

Imagine you work with some data and assume that the underlying distribution is approximately normal. In such cases, the data analysis typically involves non-robust statistics like the mean and the standard deviation. While these metrics are highly efficient under normality, they make the analysis procedure fragile: a single extreme value can corrupt all the results. You may not expect any significant outliers, but you can never be 100% sure. To avoid unexpected surprises and ensure the reliability of the results, it may be tempting to automatically exclude all outliers from the collected samples. While this approach is widely adopted, it conceals an essential part of the obtained data and can lead to fallacious conclusions.

Let me recite a classic story about ozone holes, which is typically used to illustrate the danger of blind outlier removal: