Performance stability of GitHub Actions
Nowadays, GitHub Actions is one of the most popular free CI systems. It’s quite convenient to use it to run unit and integration tests. However, some developers try to use it to run benchmarks and performance tests. Unfortunately, default GitHub Actions build agents do not provide a consistent execution environment from the performance point of view. Therefore, performance measurements from different builds can not be compared. This makes it almost impossible to set up reliable performance tests based on the default GitHub Actions build agent pool.
So, it’s expected that the execution environments are not absolutely identical. But how bad is the situation? What’s the maximum difference between performance measurements from different builds? Is there a chance that we can play with thresholds and utilize GitHub Actions to detect at least major performance degradations? Let’s find out!
Read more
p-value distribution of the Brunner–Munzel test in the finite case
In our of the previous post, I explored the distribution of observed p-values for the Mann–Whitney U test in the finite case when the null hypothesis is true. It is time to repeat the experiment for the Brunner–Munzel test.
Read more
Comparing statistical power of the Mann-Whitney U test and the Brunner-Munzel test
In this post, we perform a short numerical simulation to compare the statistical power of the Mann-Whitney U test and the Brunner-Munzel test under normality for various sample sizes and significance levels.
Read more
p-value distribution of the Mann–Whitney U test in the finite case
When we work with null hypothesis significance testing and the null hypothesis is true, the distribution of observed p-value is asymptotically uniform. However, the distribution shape is not always uniform in the finite case. For example, when we work with rank-based tests like the Mann–Whitney U test, the distribution of the p-values is discrete with a limited set of possible values. This should be taken into account when we design a testing procedure for small samples and choose the significance level.
Previously, we already discussed the minimum reasonable significance level of the Mann-Whitney U test for small samples. In this post, we explore the full distribution of the p-values for this case.
Read more
Corner case of the Brunner–Munzel test
The Brunner–Munzel test is a nonparametric significance test, which can be considered an alternative to the Mann–Whitney U test. However, the Brunner–Munzel test has a corner case that can cause some practical issues with applying this test to real data. In this post, I briefly discuss the test itself and the corresponding corner case.
Read more
Examples of the Mann–Whitney U test misuse cases
The Mann–Whitney U test is one of the most popular nonparametric statistical tests. Its alternative hypothesis claims that one distribution is stochastically greater than the other. However, people often misuse this test and try to apply it to check if two nonparametric distributions are not identical or that there is a difference in distribution medians (while there are no additional assumptions on the shapes of the distributions). In this post, I show several cases in which the Mann–Whitney U test is not applicable for comparing two distributions.
Read more
Types of finite-sample consistency with the standard deviation
Let us say we have a robust dispersion estimator \(\operatorname{T}(X)\). If it is asymptotically consistent with the standard deviation, we can use such an estimator as a robust replacement for the standard deviation under normality. Thanks to asymptotical consistency, we can use the estimator “as is” for large samples. However, if the number of sample elements is small, we typically need finite-sample bias-correction factors to make the estimator unbiased. Here we should clearly understand what kind of consistency we need.
There are various ways to estimate the standard deviation. Let us consider a sample of random variables \(X = \{ X_1, X_2, \ldots, X_n \}\). The most popular equation of the standard deviation is given by
\[s(X) = \sqrt{\frac{1}{n - 1} \sum_{i=1}^n (X_i - \overline{X})^2}. \]
Using this definition, we can get an unbiased estimator for the population variance: \(\mathbb{E}[s^2(X)] = 1\). However, it is a biased estimator for the population standard deviation: \(\mathbb{E}[s(X)] \neq 1\). To obtain to corresponding unbiased estimator, we should use \(s(\mathbf{x}) \cdot c_4(n)\), where \(c_4(n)\) is a correction factor defined as follows:
\[c_4(n) = \sqrt{\frac{2}{n-1}} \cdot \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n-1}{2}\right)}. \]
When we define finite-sample bias-correction factors for a robust standard deviation replacement, we should choose which kind of consistency we need. In this post, I briefly explore available options.
Read more
Thoughts about outlier removal and ozone holes
Imagine you work with some data and assume that the underlying distribution is approximately normal. In such cases, the data analysis typically involves non-robust statistics like the mean and the standard deviation. While these metrics are highly efficient under normality, they make the analysis procedure fragile: a single extreme value can corrupt all the results. You may not expect any significant outliers, but you can never be 100% sure. To avoid unexpected surprises and ensure the reliability of the results, it may be tempting to automatically exclude all outliers from the collected samples. While this approach is widely adopted, it conceals an essential part of the obtained data and can lead to fallacious conclusions.
Let me recite a classic story about ozone holes, which is typically used to illustrate the danger of blind outlier removal:
Read more
Nonparametric effect size: Cohen's d vs. Glass's delta
In the previous posts, I discussed the idea of nonparametric effect size measures consistent with Cohen’s d under normality. However, Cohen’s d is not always the best effect size measure, even in the normal case.
In this post, we briefly discuss a case study in which a nonparametric version of Glass’s delta is preferable than the previously suggested Cohen’s d-consistent measure.
Read more
Trinal statistical thresholds
When we design a test for practical significance, which compares two samples, we should somehow express the threshold. The most popular options are the shift, the ratio, and the effect size. Unfortunately, if we have little information about the underlying distributions, it’s hard to get a reliable test based only on a single threshold. And it’s almost impossible to define a generic threshold that fits all situations. After struggling with a lot of different thresholding approaches, I came up with the idea of setting a trinal threshold that includes three individual thresholds for the shift, the ratio, and the effect size.
In this post, I show some examples in which a single threshold is not enough.
Read more