Degrees of practical significance


Let’s say we have two data samples, and we want to check if there is a difference between them. If we are talking about any kind of difference, the answer is most probably yes. It’s highly unlikely that two random samples are identical. Even if they are, there are still chances that we observe such a situation by accident, and there is a difference in the underlying distributions. Therefore, the discussion about the existence of any kind of difference is not meaningful.

To make more meaningful insights, researchers often talk about statistical significance. The approach can also be misleading. If the sample size is large enough, we are almost always able to detect even a neglectable difference and obtain a statistically significant result for any pair of distributions. On the other hand, a huge difference can be declared insignificant if the sample size is small. While the concept is interesting and well-researched, it rarely matches the actual research goal. I strongly believe that we should not test for the nil hypothesis (checking if the true difference is exactly zero).

Here, we can switch from statistical significance to practical significance. We are supposed to define a threshold (e.g., in terms of minimum effect size) for the difference that is meaningful for the research. This approach has more chances to be aligned with the research goals. However, it is also not always satisfying enough. We should keep in mind that hypothesis testing often arises in the context of decision-making problems. In some cases, we can do exploration research in which we just want to have a better understanding of the world. However, in most cases, we do not perform calculations just because we are curious; we often want to make a decision based on the results. And this is the most crucial moment. It should always be the starting point in any research project. First of all, we should clearly describe the possible decisions and their preconditions. When we start doing that, we can discover that not all the practically significant outcomes are equally significant. If different practically significant results may lead to different decisions, we should define the proper classification in advance during the research design stage. The dichotomy of “practically significant” vs. “not practically significant” may conceal important problem aspects and lead to a wrong decision.

In this post, I would like to discuss the degrees of practical significance and show an example of how important it is for some problems.

Read more


Weighted Mann-Whitney U test, Part 3


I continue building a weighted version of the Mann–Whitney $U$ test. While previously suggested approach feel promising, I don’t like the usage of Bootstrap to obtain the $p$-value. It is always better to have a deterministic and exact approach where it’s possible. I still don’t know how to solve it in general case, but it seems that I’ve obtained a reasonable solution for some specific cases. The current version of the approach still has issues and requires additional correction factors in some cases and additional improvements. However, it passes my minimal requirements, so it is worth trying to continue developing this idea. In this post, I share the description of the weighted approach and provide numerical examples.

Read more


Andreas Löffler's implementation of the exact p-values calculations for the Mann-Whitney U test


Mann-Whitney is one of the most popular non-parametric statistical tests. Unfortunately, most test implementations in statistical packages are far from perfect. The exact p-value calculation is time-consuming and can be impractical for large samples. Therefore, most implementations automatically switch to the asymptotic approximation, which can be quite inaccurate. Indeed, the classic normal approximation could produce enormous errors. Thanks to the Edgeworth expansion, the accuracy can be improved, but it is still not always satisfactory enough. I prefer using the exact p-value calculation whenever possible.

The computational complexity of the exact p-value calculation using the classic recurrent equation suggested by Mann and Whitney is $\mathcal{O}(n^2 m^2)$ in terms of time and memory. It’s not a problem for small samples, but for medium-size samples, it is slow, and it has an extremely huge memory footprint. This gives us an unpleasant dilemma: either we use the exact p-value calculation (which is extremely time and memory-consuming), or we use the asymptotic approximation (which gives poor accuracy).

Last week, I got acquainted with a brilliant algorithm for the exact p-value calculation suggested by Andreas Löffler in 1982. It’s much faster than the classic approach, and it requires only $\mathcal{O}(n+m)$ memory.

Read more


Eclectic statistics


In the world of mathematical statistics, there is a constant confrontation between adepts of different paradigms. This is a constant source of confusion for many researchers who struggle to pick out the proper approach to follow. For example, how to choose between the frequentist and Bayesian approaches? Since these paradigms may produce inconsistent results (e.g., see Lindley’s paradox), some choice has to be made. The easiest way to conduct research is to pick a single paradigm and stick to it. The right way to conduct research is to carefully think.

Read more


Change Point Detection and Recent Changes


Change point detection (CPD) in time series analysis is an essential tool for identifying significant shifts in data patterns. These shifts, or “change points,” can signal critical transitions in various contexts. While most CPD algorithms are adept at discovering historical change points, their sensitivity in detecting recent changes can be limited, often due to a key parameter: the minimum distance between sequential change points. In this post, I share some speculations on how we can improve cpd analysis by combining two change point detectors.

Read more


Merging extended P² quantile estimators, Part 1


P² quantile estimator is a streaming quantile estimator with $\mathcal{O}(1)$ memory footprint and an extremely fast update procedure. Several days ago, I learned that it was adopted for the new Paint.NET GPU-based Median Sketch effect (the description is here). While P² meets the basic problem requirement (streaming median approximation without storing all the values), the algorithm performance is still not acceptable without additional adjustments. A significant performance improvement can be obtained if we split the input stream, process each part separately with a separate P², and merge the results. Unfortunately, the merging procedure is a tricky thing to implement. I enjoy such challenges, so I decided to attempt to build such a merging approach. In this post, I describe my first attempt.

Read more


Hodges-Lehmann ratio estimator vs. Bhattacharyya's scale ratio estimator


Previously, I discussed an idea of a ratio estimator based on the Hodges-Lehmann estimator. This idea looks so simple and natural that I was sure that it must have already been proposed and studied. However, when I started to search for it, it turned out that it was not as easy as I expected. Moreover, some papers attribute this idea to Bhattacharyya, which is not accurate. In this post, we discuss the difference between these two approaches.

Read more


Finite-sample Gaussian efficiency: Shamos vs. Rousseeuw-Croux Qn scale estimators


Previously, we compared the finite-sample Gaussian efficiency of the Rousseeuw-Croux scale estimators and the QAD estimator. In this post, we compare the finite-sample Gaussian efficiency of the Shamos scale estimator and the Rousseeuw-Croux $Q_n$ scale estimator. This is a particularly interesting comparison. In the famous “Alternatives to the Median Absolute Deviation” (1993) paper by Peter J. Rousseeuw and Christophe Croux, the authors presented $Q_n$ as an improved version of the Shamos estimator. Both estimators are based on the set of pairwise absolute differences between the elements of the sample. The Shamos estimator takes the median of this set and, therefore, has the asymptotic breakdown point of $\approx 29\%$ and the asymptotic Gaussian efficiency of $\approx 86\%$. $Q_n$ takes the first quartile of this set and, therefore, has the asymptotic breakdown point of $\approx 50\%$ (like the median) and the asymptotic Gaussian efficiency of $\approx 82\%$. It sounds like a good deal: we trade $4\%$ of the asymptotic Gaussian efficiency for $21\%$ of the asymptotic breakdown point. What could possibly stop us from using $Q_n$ everywhere instead of the Shamos estimator?

Well, here is a trick. The breakdown point of $29\%$ is actually a practically reasonable value. If more than $29\%$ of the sample are outliers, we should probably consider them not as outliers but as a separate mode. Such a situation should be handled by a multimodality detector and lead us to a different approach. The usage of dispersion estimators in the case of multimodal distributions is potentially misleading. When such a multimodality diagnostic scheme is used, there is no practical need for a higher breakdown point.

Thus, the breakdown point of $50\%$ is not so impressive property of $Q_n$. Meanwhile, the drop in Gaussian efficiency is not so enjoyable. $4\%$ may sound like a negligible difference, but it is only the asymptotic value. In real life, we typically tend to work with finite samples. Let us explore the actual finite-sample Gaussian efficiency values of these estimators.

Read more


Two-pass change point detection for temporary interval condensation


When we choose a change point detection algorithm, the most important thing is to clearly understand why we want to detect the change points. The knowledge of the final business goals is essential. In this post, I show a simple example of how a business requirement can be translated into algorithm adjustments.

Read more


Inconsistent violin plots


The usefulness and meaningfulness of the violin plots are dubious (e.g., see this video and the corresponding discussion). While this type of plot inherits issues of density plots (e.g., the bandwidth selection problem) and box plots, it also introduces new problems. One such problem is data inconsistency: default density plots and box plots are often incompatible with each other. In this post, I show an example of this inconsistency.

Read more