Effect Sizes and Asymmetry


Cohen’s d is one of the most popular measures of the effect size. Unfortunately, it was designed for the normal distribution, which may make it a misleading measure in the non-normal case. And the real distributions are never normal. When we discuss deviations from normality, we should treat the illusion of normality not as an atomic mental construction, but rather as a set of independent assumptions, each of which may be violated independently. In this post, I take a look at what kind of issues we may have when the symmetry assumption is heavily violated.

Read more


Pragmatic Statistics Manifesto


Statistics is one of the most confusing, controversial, and depressing disciplines I know. So many different approaches, so many different opinions, so many arguments, so many person-years of wasted time, and so many flawed peer-reviewed papers.

What we want from statistics is an easy-to-use tool that would nudge us toward asking the right questions and then straightforwardly guide us on how to design proper and relevant statistical procedures. What we have is a bunch of vaguely described sets of strange equations, a few arbitrarily chosen magical numbers as thresholds, and no clear understanding of what to do.

In the scientific community, there are a lot of adherents of Frequentist statistics (both Neyman-Pearson and Fisherian), Bayesian statistics, Likelihood statistics, Nonparametric statistics, Robust statistics, and many other statistics. And almost no one discusses Pragmatic statistics. I feel like we really need something which is called Pragmatic statistics. However, it should not be just a set of “blessed” approaches but rather a mindset.

Let me make an attempt to speculate on the principles that should form the foundation of the Pragmatic statistics approach. In future posts, I will show how to apply these principles to solve real-world problems.

Read more


The Effect Existence, Its Magnitude, and the Goals


If you are curious if something impacts something else, the answer is probably “yes.” Does that indicator depend on those factors? Yes, it does. If we change this thing, would it affect …? Yes, it would. If a person takes this pill, could it cause a non-exactly-zero change in the body? Yes, the presence of the pill is already a change that can always be detected with the right amount of effort.

One may argue that in some cases (assuming the list of specific cases is presented), zero effect does exist. For a moment, let us pretend that it is true. Now, let us imagine a parallel universe, which is the same as ours but with the presence of the effect. Unfortunately, the effect is so small that our tools are not sophisticated enough to detect it. Imagine being put into one of these worlds, but you don’t know which one. How do you determine the existence of the effect? Of course, you can improve the resolution of the measurement tools via new scientific discoveries, but with the current state of technology, the absence of the effect cannot be checked. Therefore, it is always safer to assume that the effect exists, keeping in mind that it can be negligible. Let us accept this assumption and continue if it is absolute truth.

Read more


Case Study: A City Social Survey


Imagine a city mayor considering a project offering to build parks in several neighborhoods. It can be a good budget investment since it can potentially increase the happiness level of the citizens. However, it is just a hypothesis: if parks do not impact happiness, it is worth considering other city renovation projects. It makes sense to perform a pilot experiment before spending the budget on all the parks. The mayor is thinking about the following plan: pick a random neighborhood, survey the citizens to measure their happiness, build a park, survey the citizens again, compare the survey results, make a decision about the further parks in other neighborhoods. Someone is needed to design the survey and draw the conclusion.

Let us explore possible approaches to perform such a study. These artificial examples are not guidelines but rather simplified illustrations of possible mindsets presented as lists of thoughts. In this demonstration, we mainly focus on the attitude to the research process rather than on the technical details. All the examples are based on real stories.

Read more


Simplifying adjustments of confidence levels and practical significance thresholds


Translation of the buisness goals to the actual parameters of the statistical procedure is a non-trivial task. The degree of non-triviality increases if we should adjust several parameters at the same time. In this post, we consider a problem of simultaneous choice of the confidence level and the practical significance threshold. We discuss possible pitfalls and how to simplify the adjusting procedure to avoid them.

Read more


Degrees of practical significance


Let’s say we have two data samples, and we want to check if there is a difference between them. If we are talking about any kind of difference, the answer is most probably yes. It’s highly unlikely that two random samples are identical. Even if they are, there are still chances that we observe such a situation by accident, and there is a difference in the underlying distributions. Therefore, the discussion about the existence of any kind of difference is not meaningful.

To make more meaningful insights, researchers often talk about statistical significance. The approach can also be misleading. If the sample size is large enough, we are almost always able to detect even a neglectable difference and obtain a statistically significant result for any pair of distributions. On the other hand, a huge difference can be declared insignificant if the sample size is small. While the concept is interesting and well-researched, it rarely matches the actual research goal. I strongly believe that we should not test for the nil hypothesis (checking if the true difference is exactly zero).

Here, we can switch from statistical significance to practical significance. We are supposed to define a threshold (e.g., in terms of minimum effect size) for the difference that is meaningful for the research. This approach has more chances to be aligned with the research goals. However, it is also not always satisfying enough. We should keep in mind that hypothesis testing often arises in the context of decision-making problems. In some cases, we can do exploration research in which we just want to have a better understanding of the world. However, in most cases, we do not perform calculations just because we are curious; we often want to make a decision based on the results. And this is the most crucial moment. It should always be the starting point in any research project. First of all, we should clearly describe the possible decisions and their preconditions. When we start doing that, we can discover that not all the practically significant outcomes are equally significant. If different practically significant results may lead to different decisions, we should define the proper classification in advance during the research design stage. The dichotomy of “practically significant” vs. “not practically significant” may conceal important problem aspects and lead to a wrong decision.

In this post, I would like to discuss the degrees of practical significance and show an example of how important it is for some problems.

Read more


Weighted Mann-Whitney U test, Part 3


I continue building a weighted version of the Mann–Whitney $U$ test. While previously suggested approach feel promising, I don’t like the usage of Bootstrap to obtain the $p$-value. It is always better to have a deterministic and exact approach where it’s possible. I still don’t know how to solve it in general case, but it seems that I’ve obtained a reasonable solution for some specific cases. The current version of the approach still has issues and requires additional correction factors in some cases and additional improvements. However, it passes my minimal requirements, so it is worth trying to continue developing this idea. In this post, I share the description of the weighted approach and provide numerical examples.

Read more


Andreas Löffler's implementation of the exact p-values calculations for the Mann-Whitney U test


Mann-Whitney is one of the most popular non-parametric statistical tests. Unfortunately, most test implementations in statistical packages are far from perfect. The exact p-value calculation is time-consuming and can be impractical for large samples. Therefore, most implementations automatically switch to the asymptotic approximation, which can be quite inaccurate. Indeed, the classic normal approximation could produce enormous errors. Thanks to the Edgeworth expansion, the accuracy can be improved, but it is still not always satisfactory enough. I prefer using the exact p-value calculation whenever possible.

The computational complexity of the exact p-value calculation using the classic recurrent equation suggested by Mann and Whitney is $\mathcal{O}(n^2 m^2)$ in terms of time and memory. It’s not a problem for small samples, but for medium-size samples, it is slow, and it has an extremely huge memory footprint. This gives us an unpleasant dilemma: either we use the exact p-value calculation (which is extremely time and memory-consuming), or we use the asymptotic approximation (which gives poor accuracy).

Last week, I got acquainted with a brilliant algorithm for the exact p-value calculation suggested by Andreas Löffler in 1982. It’s much faster than the classic approach, and it requires only $\mathcal{O}(n+m)$ memory.

Read more


Eclectic statistics


In the world of mathematical statistics, there is a constant confrontation between adepts of different paradigms. This is a constant source of confusion for many researchers who struggle to pick out the proper approach to follow. For example, how to choose between the frequentist and Bayesian approaches? Since these paradigms may produce inconsistent results (e.g., see Lindley’s paradox), some choice has to be made. The easiest way to conduct research is to pick a single paradigm and stick to it. The right way to conduct research is to carefully think.

Read more


Change Point Detection and Recent Changes


Change point detection (CPD) in time series analysis is an essential tool for identifying significant shifts in data patterns. These shifts, or “change points,” can signal critical transitions in various contexts. While most CPD algorithms are adept at discovering historical change points, their sensitivity in detecting recent changes can be limited, often due to a key parameter: the minimum distance between sequential change points. In this post, I share some speculations on how we can improve cpd analysis by combining two change point detectors.

Read more