Posts on Andrey Akinshinhttps://aakinshin.net/posts/Recent content in Posts on Andrey AkinshinHugo -- gohugo.ioen-usTue, 17 May 2022 00:00:00 +0000Statistical efficiency of the Hodges-Lehmann median estimator, Part1https://aakinshin.net/posts/hodgeslehmann-efficiency1/Tue, 17 May 2022 00:00:00 +0000https://aakinshin.net/posts/hodgeslehmann-efficiency1/<p>In this post, we evaluate the relative statistical efficiency of the Hodges-Lehmann median estimator
against the sample median under the normal distribution.
We also compare it with the efficiency of the Harrell-Davis quantile estimator.</p>Expected value of the maximum of two standard half-normal distributionshttps://aakinshin.net/posts/expected-max-half-normal/Tue, 10 May 2022 00:00:00 +0000https://aakinshin.net/posts/expected-max-half-normal/<p>Let <span class="math inline">\(X_1, X_2\)</span> be <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">i.i.d.</a>
random variables that follow the standard normal distribution <span class="math inline">\(\mathcal{N}(0,1^2)\)</span>.
In the <a href="https://aakinshin.net/posts/expected-min-half-normal/">previous post</a>,
I have found the expected value of <span class="math inline">\(\min(|X_1|, |X_2|)\)</span>.
Now it’s time to find the value of <span class="math inline">\(Z = \max(|X_1|, |X_2|)\)</span>.</p>Expected value of the minimum of two standard half-normal distributionshttps://aakinshin.net/posts/expected-min-half-normal/Tue, 03 May 2022 00:00:00 +0000https://aakinshin.net/posts/expected-min-half-normal/<p>Let <span class="math inline">\(X_1, X_2\)</span> be <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">i.i.d.</a>
random variables that follow the standard normal distribution <span class="math inline">\(\mathcal{N}(0,1^2)\)</span>.
One day I wondered, what is the expected value of <span class="math inline">\(Z = \min(|X_1|, |X_2|)\)</span>?
It turned out to be a fun exercise.
Let’s solve it together!</p>Unbiased median absolute deviation for n=2https://aakinshin.net/posts/unbiased-mad-n2/Tue, 26 Apr 2022 00:00:00 +0000https://aakinshin.net/posts/unbiased-mad-n2/<p>I already covered the topic of the unbiased median deviation based on
<a href="https://aakinshin.net/posts/unbiased-mad/">the traditional sample median</a>,
<a href="https://aakinshin.net/posts/unbiased-mad-hd/">the Harrell-Davis quantile estimator</a>, and
<a href="https://aakinshin.net/posts/unbiased-mad-thd/">the trimmed Harrell-Davis quantile estimator</a>.
In all the posts, the values of bias-correction factors were evaluated using the Monte-Carlo simulation.
In this post, we calculate the exact value of the bias-correction factor for two-element samples.</p>Weighted trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/wthdqe/Tue, 19 Apr 2022 00:00:00 +0000https://aakinshin.net/posts/wthdqe/<p>In this post, I combine ideas from two of my previous posts:</p>
<ul>
<li><a href="https://aakinshin.net/posts/pub-thdqe/">Trimmed Harrell-Davis quantile estimator</a>:
quantile estimator that provides an optimal trade-off between statistical efficiency and robustness</li>
<li><a href="https://aakinshin.net/posts/weighted-quantiles/">Weighted quantile estimators</a>:
a general scheme that allows building weighted quantile estimators.
Could be used for <a href="https://aakinshin.net/posts/quantile-exponential-smoothing/">quantile exponential smoothing</a>
and <a href="https://aakinshin.net/posts/dispersion-exponential-smoothing/">dispersion exponential smoothing</a>.</li>
</ul>
<p>Thus, we are going to build a weighted version of the trimmed Harrell-Davis quantile estimator based on the highest
density interval of the given width.</p>Minimum meaningful statistical level for the Mann–Whitney U testhttps://aakinshin.net/posts/mann-whitney-min-stat-level/Tue, 12 Apr 2022 00:00:00 +0000https://aakinshin.net/posts/mann-whitney-min-stat-level/<p>The <a href="https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test">Mann–Whitney U test</a> is one of the most popular
nonparametric null hypothesis significance tests.
However, like any statistical test, it has limitations.
We should always carefully match them with our business requirements.
In this post, we discuss how to properly choose the statistical level for the Mann–Whitney U test on small samples.</p>
<p>Let’s say we want to compare two samples <span class="math inline">\(x = \{ x_1, x_2, \ldots, x_n \}\)</span> and <span class="math inline">\(y = \{ y_1, y_2, \ldots, y_m \}\)</span>
using the one-sided Mann–Whitney U test.
Sometimes, we don’t have an opportunity to gather enough data and we have to work with small samples.
Imagine that the size of both samples is six: <span class="math inline">\(n=m=6\)</span>.
We want to set the statistical level <span class="math inline">\(\alpha\)</span> to <span class="math inline">\(0.001\)</span> (because we really don’t want to get false-positive results).
Is it a valid requirement?
In fact, the minimum p-value we can observe with <span class="math inline">\(n=m=6\)</span> is <span class="math inline">\(\approx 0.001082\)</span>.
Thus, with <span class="math inline">\(\alpha = 0.001\)</span>, it’s impossible to get a positive result.
Meanwhile, everything is correct from the technical point of view:
since we can’t get any positive results, the false positive rate is exactly zero which is less than <span class="math inline">\(0.001\)</span>.
However, it’s definitely not something that we want: with this setup the test becomes useless because
it always provides negative results regardless of the input data.</p>
<p>This brings an important question: what is the minimum meaningful statistical level
that we can require for the one-sided Mann–Whitney U test knowing the sample sizes?</p>Fence-based outlier detectors, Part 2https://aakinshin.net/posts/fenced-outlier-detectors2/Tue, 05 Apr 2022 00:00:00 +0000https://aakinshin.net/posts/fenced-outlier-detectors2/<p>In the <a href="https://aakinshin.net/posts/fenced-outlier-detectors1/">previous post</a>,
I discussed different fence-based outlier detectors.
In this post, I show some examples of these detectors with different parameters.</p>Fence-based outlier detectors, Part 1https://aakinshin.net/posts/fenced-outlier-detectors1/Tue, 29 Mar 2022 00:00:00 +0000https://aakinshin.net/posts/fenced-outlier-detectors1/<p>In previous posts, I discussed properties of <a href="https://aakinshin.net/posts/tukey-outlier-probability/">Tukey’s fences</a>
and asymmetric decile-based outlier detector
(<a href="https://aakinshin.net/posts/asymmetric-decile-outliers1/">Part 1</a>, <a href="https://aakinshin.net/posts/asymmetric-decile-outliers2/">Part 2</a>).
In this post, I discuss the generalization of fence-based outlier detectors.</p>Publication announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'https://aakinshin.net/posts/pub-thdqe/Tue, 22 Mar 2022 00:00:00 +0000https://aakinshin.net/posts/pub-thdqe/<p>Since the beginning of previous year, I have been working on building a quantile estimator
that provides an optimal trade-off between statistical efficiency and robustness.
At the end of the year, I <a href="https://aakinshin.net/posts/preprint-thdqe/">published</a> the corresponding preprint
where I presented a description of such an estimator:
<a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.
The paper source code is available on GitHub:
<a href="https://github.com/AndreyAkinshin/paper-thdqe">AndreyAkinshin/paper-thdqe</a>.</p>
<p>Finally, the paper was published in <em>Communications in Statistics - Simulation and Computation</em>.
You can cite it as follows:</p>
<ul>
<li>Andrey Akinshin (2022)
<em>Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width,</em>
Communications in Statistics - Simulation and Computation,
DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a></li>
</ul>Asymmetric decile-based outlier detector, Part 2https://aakinshin.net/posts/asymmetric-decile-outliers2/Tue, 15 Mar 2022 00:00:00 +0000https://aakinshin.net/posts/asymmetric-decile-outliers2/<p>In the <a href="https://aakinshin.net/posts/asymmetric-decile-outliers1/">previous post</a>,
I suggested an asymmetric decile-based outlier detector
as an alternative to <a href="https://aakinshin.net/posts/tukey-outlier-probability/">Tukey’s fences</a>.
In this post, we run some numerical simulations to check out
the suggested outlier detector in action.</p>Asymmetric decile-based outlier detector, Part 1https://aakinshin.net/posts/asymmetric-decile-outliers1/Tue, 08 Mar 2022 00:00:00 +0000https://aakinshin.net/posts/asymmetric-decile-outliers1/<p>In the <a href="https://aakinshin.net/posts/tukey-outlier-probability/">previous post</a>, I covered some problems with the outlier detector
based on Tukey fences.
Mainly, I discussed the probability of observing outliers using Tukey’s fences
with different factors under different distributions.
However, it’s not the only problem with this approach.</p>
<p>Since Tukey’s fences are based on quartiles,
under multimodal distributions, we could get a situation
when 50% of all sample elements are marked as outliers.
Also, Tukey’s fences are designed for symmetric distributions,
so we could get strange results with asymmetric distributions.</p>
<p>In this post, I want to suggest an asymmetric outlier detector based on deciles
which mitigates this problem.</p>Probability of observing outliers using Tukey's fenceshttps://aakinshin.net/posts/tukey-outlier-probability/Tue, 01 Mar 2022 00:00:00 +0000https://aakinshin.net/posts/tukey-outlier-probability/<p><a href="https://en.wikipedia.org/wiki/Outlier#Tukey's_fences">Tukey’s fences</a> is one of the most popular
simple outlier detectors for one-dimensional number arrays.
This approach assumes that for a given sample, we calculate first and third quartiles (<span class="math inline">\(Q_1\)</span> and <span class="math inline">\(Q_3\)</span>),
and mark all the sample elements outside the interval</p>
<p><span class="math display">\[[Q_1 - k (Q_3 - Q_1),\, Q_3 + k (Q_3 - Q_1)]
\]</span></p>
<p>as outliers.
Typical recommendation for <span class="math inline">\(k\)</span> is <span class="math inline">\(1.5\)</span> for “regular” outliers and <span class="math inline">\(3.0\)</span> for “far outliers”.
Here is a box plot example for a sample taken from the standard normal distributions (sample size is <span class="math inline">\(1000\)</span>):</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot1-light.png" target="_blank" class="imgldlink" alt="boxplot1">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot1-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot1-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot1-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>As we can see, 11 elements were marked as outliers (shown as dots).
Is it an expected result or not?
The answer depends on your goals.
There is no single definition of an outlier.
In fact, the chosen outlier detector provides a unique outlier definition.</p>
<p>In my applications, I typically consider outliers as rare events that should be investigated.
When I detect too many outliers, all such reports become useless noise.
For example, on the above image, I wouldn’t treat any of the sample elements as outliers.
However, If we add <span class="math inline">\(10.0\)</span> to this sample, this element is an obvious outlier (which will be the only one):</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot2-light.png" target="_blank" class="imgldlink" alt="boxplot2">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot2-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot2-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/tukey-outlier-probability/img/boxplot2-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Thus, an important property of an outlier detector is the “false positive rate”:
the percentage of samples with detected outliers which I wouldn’t treat as outliers.
In this post, I perform numerical simulations that show the probability of observing outliers
using Tukey’s fences with different <span class="math inline">\(k\)</span> values.</p>Gamma effect size powered by the middle non-zero quantile absolute deviationhttps://aakinshin.net/posts/gamma-es-mnzqad/Tue, 22 Feb 2022 00:00:00 +0000https://aakinshin.net/posts/gamma-es-mnzqad/<p>In <a href="https://aakinshin.net/tags/research-gamma-es/">previous posts</a>, I covered the concept of the gamma effect size.
It’s a nonparametric effect size which is consistent with Cohen’s d under the normal distribution.
However, the original definition has drawbacks: this statistic becomes zero
if half of the sample elements are equal to each other.
Last time, I <a href="https://aakinshin.net/posts/zero-mad-gamma-es/">suggested</a>) a workaround for this problem:
we can replace the median absolute deviation by the <a href="https://aakinshin.net/posts/qad/">quantile absolute deviation</a>.
Unfortunately, this trick requires parameter tuning:
we should choose a proper quantile position to make this approach work.
Today I want to suggest a strategy that provides a way to make a generic choice:
we can use the <a href="https://aakinshin.net/posts/mnzqad/">middle non-zero quantile absolute deviation</a>.</p>Middle non-zero quantile absolute deviationhttps://aakinshin.net/posts/mnzqad/Tue, 15 Feb 2022 00:00:00 +0000https://aakinshin.net/posts/mnzqad/<p>Median absolute deviation (<span class="math inline">\(\operatorname{MAD}\)</span>) around the median is a popular robust measure of statistical dispersion.
Unfortunately, if we <a href="https://aakinshin.net/posts/discrete-performance-distributions/">work</a> with discrete distributions,
we could get zero <span class="math inline">\(\operatorname{MAD}\)</span> values.
It could bring some problems if we <a href="https://aakinshin.net/posts/zero-mad-gamma-es/">use</a> <span class="math inline">\(\operatorname{MAD}\)</span> as a denominator.
Such a problem is also relevant to some other quantile-based measures of dispersion
like interquartile range (<span class="math inline">\(\operatorname{IQR}\)</span>).</p>
<p>This problem could be solved using the <a href="https://aakinshin.net/posts/qad/">quantile absolute deviation</a> around the median.
However, it’s not always clear how to choose the right quantile to estimate.
In this post, I’m going to suggest a choosing approach that is consistent with the classic <span class="math inline">\(\operatorname{MAD}\)</span>
under continuous distributions (and samples without tied values).</p>Unbiased median absolute deviation based on the trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/unbiased-mad-thd/Tue, 08 Feb 2022 00:00:00 +0000https://aakinshin.net/posts/unbiased-mad-thd/<p>The <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (<span class="math inline">\(\operatorname{MAD}\)</span>)
is a robust measure of scale.
For a sample <span class="math inline">\(x = \{ x_1, x_2, \ldots, x_n \}\)</span>, it’s defined as follows:</p>
<p><span class="math display">\[\operatorname{MAD}_n = C_n \cdot \operatorname{median}(|x - \operatorname{median}(x)|)
\]</span></p>
<p>where <span class="math inline">\(\operatorname{median}\)</span> is a median estimator, <span class="math inline">\(C_n\)</span> is a scale factor.
Using the right scale factor, we can use <span class="math inline">\(\operatorname{MAD}\)</span> as a consistent estimator
for the estimation of the standard deviation under the normal distribution.
For huge samples, we can use the asymptotic value of <span class="math inline">\(C_n\)</span> which is</p>
<p><span class="math display">\[C_\infty = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056.
\]</span></p>
<p>For small samples, we should use adjusted values <span class="math inline">\(C_n\)</span> which depend on the sample size.
However, <span class="math inline">\(C_n\)</span> depends not only on the sample size but also on the median estimator.
I have already covered how to obtain this values for
<a href="https://aakinshin.net/posts/unbiased-mad/">the traditional median estimator</a> and
<a href="https://aakinshin.net/posts/unbiased-mad-hd/">the Harrell-Davis median estimator</a>.
It’s time to get the <span class="math inline">\(C_n\)</span> values for
<a href="https://aakinshin.net/posts/preprint-thdqe/">the trimmed Harrell-Davis median estimator</a>.</p>Median absolute deviation vs. Shamos estimatorhttps://aakinshin.net/posts/mad-vs-shamos/Tue, 01 Feb 2022 00:00:00 +0000https://aakinshin.net/posts/mad-vs-shamos/<p>There are multiple ways to estimate statistical dispersion.
The standard deviation is the most popular one, but it’s not robust:
a single outlier could heavily corrupt the results.
Fortunately, we have robust measures of dispersions like the <em>median absolute deviation</em> and the <em>Shamos estimator</em>.
In this post, we perform numerical simulations and
compare these two estimators on different distributions and sample sizes.</p>Moving extended P² quantile estimatorhttps://aakinshin.net/posts/moving-ex-p2-quantile-estimator/Tue, 25 Jan 2022 00:00:00 +0000https://aakinshin.net/posts/moving-ex-p2-quantile-estimator/<p>In the previous posts, I discussed
<a href="https://aakinshin.net/posts/p2-quantile-estimator/">the P² quantile estimator</a>
(a sequential estimator which takes <span class="math inline">\(O(1)\)</span> memory and estimates a single predefined quantile),
<a href="https://aakinshin.net/posts/mp2-quantile-estimator/">the moving P² quantile estimator</a>
(a moving modification of P² which estimates quantiles within the moving window),
and <a href="https://aakinshin.net/posts/ex-p2-quantile-estimator/">the extended P² quantile estimator</a>
(a sequential estimator which takes <span class="math inline">\(O(m)\)</span> memory and estimates <span class="math inline">\(m\)</span> predefined quantiles).</p>
<p>Now it’s time to build <em>the moving modification of the extended P² quantile estimator</em>
which estimates <span class="math inline">\(m\)</span> predefined quantiles using <span class="math inline">\(O(m)\)</span> memory within the moving window.</p>Extended P² quantile estimatorhttps://aakinshin.net/posts/ex-p2-quantile-estimator/Tue, 18 Jan 2022 00:00:00 +0000https://aakinshin.net/posts/ex-p2-quantile-estimator/<p>I already covered <em>the P² quantile estimator</em> and its possible implementation improvements
in <a href="https://aakinshin.net/tags/research-p2qe/">several blog posts</a>.
This sequential estimator uses <span class="math inline">\(O(1)\)</span> memory and allows estimating a single predefined quantile.
Now it’s time to discuss <em>the extended P² quantile estimator</em> that allows estimating multiple predefined quantiles.
This extended version was suggested in the paper
<a href="https://doi.org/10.1177/003754978704900405">“Simultaneous estimation of several percentiles”</a>.
In this post, we briefly discuss the approach from this paper and how we can improve its implementation.</p>P² quantile estimator marker adjusting orderhttps://aakinshin.net/posts/p2-quantile-estimator-adjusting-order/Tue, 11 Jan 2022 00:00:00 +0000https://aakinshin.net/posts/p2-quantile-estimator-adjusting-order/<p>I have already written a few blog posts about the P² quantile estimator
(which is a sequential estimator that uses <span class="math inline">\(O(1)\)</span> memory):</p>
<ul>
<li><a href="https://aakinshin.net/posts/p2-quantile-estimator/">P² quantile estimator: estimating the median without storing values</a></li>
<li><a href="https://aakinshin.net/posts/p2-quantile-estimator-rounding-issue/">P² quantile estimator rounding issue</a></li>
<li><a href="https://aakinshin.net/posts/p2-quantile-estimator-initialization/">P² quantile estimator initialization strategy</a></li>
</ul>
<p>In this post, we continue improving the P² implementation
so that it gives better estimations for streams with a small number of elements.</p>P² quantile estimator initialization strategyhttps://aakinshin.net/posts/p2-quantile-estimator-initialization/Tue, 04 Jan 2022 00:00:00 +0000https://aakinshin.net/posts/p2-quantile-estimator-initialization/<p><strong>Update: the estimator accuracy could be improved using a bunch of <a href="https://aakinshin.net/tags/research-p2qe/">patches</a>.</strong></p>
<p>The P² quantile estimator is a sequential estimator that uses <span class="math inline">\(O(1)\)</span> memory.
Thus, for the given sequence of numbers, it allows estimating quantiles without storing values.
I have already written a few blog posts about it:</p>
<ul>
<li><a href="https://aakinshin.net/posts/p2-quantile-estimator/">P² quantile estimator: estimating the median without storing values</a></li>
<li><a href="https://aakinshin.net/posts/p2-quantile-estimator-rounding-issue/">P² quantile estimator rounding issue</a></li>
</ul>
<p>I tried this estimator in various contexts, and it shows pretty decent results.
However, recently I stumbled on a corner case:
if we want to estimate extreme quantile (<span class="math inline">\(p < 0.1\)</span> or <span class="math inline">\(p > 0.9\)</span>),
this estimator provides inaccurate results on small number streams (<span class="math inline">\(n < 10\)</span>).
While it looks like a minor issue, it would be nice to fix it.
In this post, we briefly discuss choosing a better initialization strategy to workaround this problem.</p>Misleading geometric meanhttps://aakinshin.net/posts/misleading-geometric-mean/Tue, 28 Dec 2021 00:00:00 +0000https://aakinshin.net/posts/misleading-geometric-mean/<p>There are multiple ways to compute the “average” value of an array of numbers.
One of such ways is the <em>geometric mean</em>.
For a sample <span class="math inline">\(x = \{ x_1, x_2, \ldots, x_n \}\)</span>, the geometric means is defined as follows:</p>
<p><span class="math display">\[\operatorname{GM}(x) = \sqrt[n]{x_1 x_2 \ldots x_n}
\]</span></p>
<p>This approach is widely recommended for some specific tasks.
Let’s say we want to compare the performance of two machines <span class="math inline">\(M_x\)</span> and <span class="math inline">\(M_y\)</span>.
In order to do this, we design a set of benchmarks <span class="math inline">\(b = \{b_1, b_2, \ldots, b_n \}\)</span>
and obtain two sets of measurements
<span class="math inline">\(x = \{ x_1, x_2, \ldots, x_n \}\)</span> and <span class="math inline">\(y = \{ y_1, y_2, \ldots, y_n \}\)</span>.
Once we have these two samples, we may have a desire to express the difference
between two machines as a single number and get a conclusion like
“Machine <span class="math inline">\(M_y\)</span> works two times faster than <span class="math inline">\(M_x\)</span>.”
I think that this approach is flawed because such a difference couldn’t be expressed as a single number:
the result heavily depends on the workloads that we analyze.
For example, imagine that <span class="math inline">\(M_x\)</span> is a machine with HDD and fast CPU, <span class="math inline">\(M_y\)</span> is a machine with SSD and slow CPU.
In this case, <span class="math inline">\(M_x\)</span> could be faster on CPU-bound workloads while <span class="math inline">\(M_y\)</span> could be faster on disk-bound workloads.
I really like this summary from
<a href="https://www.eecs.umich.edu/techreports/cse/95/CSE-TR-231-95.pdf">“Notes on Calculating Computer Performance”</a>
by Bruce Jacob and Trevor Mudge (in the same paper, the authors criticize the approach with the geometric mean):</p>
<blockquote>
<p>Performance is therefore not a single number, but really a collection of implications.
It is nothing more or less than the measure of
how much time <em>we</em> save running <em>our</em> tests on the machines in question.
If someone else has similar needs to ours, our performance numbers will be useful to them.
However, two people with different sets of criteria will likely walk away
with two completely different performance numbers for the same machine.</p>
</blockquote>
<p>However, some other authors (e.g., <a href="https://doi.org/10.1145/5666.5673">“How not to lie with statistics: the correct way to summarize benchmark results”</a>)
actually recommend using the geometric mean to get such a number
that describes the performance ratio of <span class="math inline">\(M_x\)</span> and <span class="math inline">\(M_y\)</span>.
I have to admit that the geometric mean <em>could</em> provide a reasonable result in <em>some simple cases</em>.
Indeed, on normalized numbers, it works much better than the arithmetic mean
(that provides meaningless result) because of its nice <a href="https://en.wikipedia.org/wiki/Geometric_mean#Application_to_normalized_values">property</a>:
<span class="math inline">\(\operatorname{GM}(x_i/y_i) = \operatorname{GM}(x_i) / \operatorname{GM}(y_i)\)</span>.
However, it doesn’t work properly in the general case.
Firstly, the desire to express the difference between two machines is vicious:
the result heavily depends on the chosen workloads.
Secondly, the performance of a single benchmark <span class="math inline">\(b_i\)</span> couldn’t be described as a single number <span class="math inline">\(x_i\)</span>:
we should consider the whole performance distributions.
In order to describe the difference between two distributions,
we could consider the <a href="https://aakinshin.net/posts/shift-and-ratio-functions/">shift and ration functions</a>
(that work much better than the <a href="https://aakinshin.net/posts/shift-function-vs-distribution/">shift</a> and
<a href="https://aakinshin.net/posts/ratio-function-vs-distribution/">ratio</a> distributions).</p>
<p>Even if you consider a pretty homogenous set of benchmarks and all the distributions are pretty narrow,
the geometric mean has severe drawbacks that you should keep in mind.
In this post, I briefly cover some of these drawbacks and highlight problems that you may have if you use this metric.</p>Matching quantile sets using likelihood based on the binomial coefficientshttps://aakinshin.net/posts/binomial-quantile-likelihood/Tue, 21 Dec 2021 00:00:00 +0000https://aakinshin.net/posts/binomial-quantile-likelihood/<p>Let’s say we have a distribution <span class="math inline">\(X\)</span> that is given by its <span class="math inline">\(s\)</span>-quantile values:</p>
<p><span class="math display">\[q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1})
\]</span></p>
<p>where <span class="math inline">\(Q_X\)</span> is the quantile function of <span class="math inline">\(X\)</span>, <span class="math inline">\(p_j = j / s\)</span>.</p>
<p>We also have a sample <span class="math inline">\(y = \{y_1, y_2, \ldots, y_n \}\)</span> that is given by its <span class="math inline">\(s\)</span>-quantile estimations:</p>
<p><span class="math display">\[q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}),
\]</span></p>
<p>where <span class="math inline">\(Q_y\)</span> is the quantile estimation function for sample <span class="math inline">\(y\)</span>.
We also assume that <span class="math inline">\(q_{y_0} = \min(y)\)</span>, <span class="math inline">\(q_{y_s} = \max(y)\)</span>.</p>
<p>We want to know the likelihood of “<span class="math inline">\(y\)</span> is drawn from <span class="math inline">\(X\)</span>”.
In this post, I want to suggest a nice way to do this using the binomial coefficients.</p>Ratio function vs. ratio distributionhttps://aakinshin.net/posts/ratio-function-vs-distribution/Tue, 14 Dec 2021 00:00:00 +0000https://aakinshin.net/posts/ratio-function-vs-distribution/<p>Let’s say we have two distributions <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>.
In the <a href="https://aakinshin.net/posts/shift-function-vs-distribution/">previous post</a>,
we discussed how to express the “absolute difference” between them
using the shift function and the shift distribution.
Now let’s discuss how to express the “relative difference” between them.
This abstract term also could be expressed in various ways.
My favorite approach is to build the <a href="https://aakinshin.net/posts/shift-and-ratio-functions/">ratio function</a>.
In order to do this, for each quantile <span class="math inline">\(p\)</span>, we should calculate <span class="math inline">\(Q_Y(p)/Q_X(p)\)</span> where <span class="math inline">\(Q\)</span> is the quantile function.
However, some people prefer using the <a href="https://en.wikipedia.org/wiki/Ratio_distribution">ratio distribution</a> <span class="math inline">\(Y/X\)</span>.
While both approaches may provide similar results for narrow positive non-overlapping distributions,
they are not equivalent in the general case.
In this post, we briefly consider examples of both approaches.</p>Shift function vs. shift distributionhttps://aakinshin.net/posts/shift-function-vs-distribution/Tue, 07 Dec 2021 00:00:00 +0000https://aakinshin.net/posts/shift-function-vs-distribution/<p>Let’s say we have two distributions <span class="math inline">\(X\)</span> and <span class="math inline">\(Y\)</span>,
and we want to express the “absolute difference” between them.
This abstract term could be expressed in various ways.
My favorite approach is to build the <a href="https://aakinshin.net/posts/shift-and-ratio-functions/">Doksum’s shift function</a>.
In order to do this, for each quantile <span class="math inline">\(p\)</span>, we should calculate <span class="math inline">\(Q_Y(p)-Q_X(p)\)</span> where <span class="math inline">\(Q\)</span> is the quantile function.
However, some people prefer using the shift distribution <span class="math inline">\(Y-X\)</span>.
While both approaches may provide similar results for narrow non-overlapping distributions,
they are not equivalent in the general case.
In this post, we briefly consider examples of both approaches.</p>Preprint announcement: 'Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width'https://aakinshin.net/posts/preprint-thdqe/Tue, 30 Nov 2021 00:00:00 +0000https://aakinshin.net/posts/preprint-thdqe/<p><strong>Update: the <a href="https://aakinshin.net/posts/pub-thdqe/">final paper</a> was published in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).</strong></p>
<p>Since the beginning of this year, I have been working on building a quantile estimator
that provides an optimal trade-off between statistical efficiency and robustness.
Finally, I have built such an estimator.
A paper preprint is available on arXiv:
<a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.
The paper source code is available on GitHub:
<a href="https://github.com/AndreyAkinshin/paper-thdqe">AndreyAkinshin/paper-thdqe</a>.
You can cite it as follows:</p>
<ul>
<li>Andrey Akinshin (2021)
Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width,
<a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776</a></li>
</ul>Non-normal median sampling distributionhttps://aakinshin.net/posts/non-normal-median-distribution/Tue, 23 Nov 2021 00:00:00 +0000https://aakinshin.net/posts/non-normal-median-distribution/<p>Let’s consider the classic sample median.
If a sample is sorted and the number of sample elements is odd, the median is the middle element.
In the case of an even number of sample elements, the median is an arithmetic average of the two middle elements.</p>
<p>Now let’s say we randomly take many samples from the same distribution and calculate the median for each of them.
Next, we build a sampling distribution based on these median values.
There is a well-known fact that this distribution is asymptotically normal with mean <span class="math inline">\(M\)</span> and variance <span class="math inline">\(1/(4nf^2(M))\)</span>,
where <span class="math inline">\(n\)</span> is the number of elements in samples,
<span class="math inline">\(f\)</span> is the probability density function of the original distribution,
and <span class="math inline">\(M\)</span> is the true median of the original distribution.</p>
<p>Unfortunately, if we try to build such sampling distributions in practice,
we may see that they are not always normal.
There are some corner cases that prevent us from using the normal model in general.
If you implement general routines that analyze the median behavior,
you should keep such cases in mind.
In this post, we briefly talk about some of these cases.</p>Misleading kurtosishttps://aakinshin.net/posts/misleading-kurtosis/Tue, 16 Nov 2021 00:00:00 +0000https://aakinshin.net/posts/misleading-kurtosis/<p>I already discussed misleadingness of such metrics like
<a href="https://aakinshin.net/posts/misleading-stddev/">standard deviation</a> and <a href="https://aakinshin.net/posts/misleading-skewness/">skewness</a>.
It’s time to discuss misleadingness of the measure of tailedness: kurtosis
(which, sometimes, could be incorrectly interpreted as a measure of peakedness).
Typically, the concept of kurtosis is explained with the help of images like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/misleading-kurtosis/img/kurt_intro-light.png" target="_blank" class="imgldlink" alt="kurt_intro">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/misleading-kurtosis/img/kurt_intro-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/misleading-kurtosis/img/kurt_intro-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/misleading-kurtosis/img/kurt_intro-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Unfortunately, the raw kurtosis value may provide wrong insights about distribution properties.
In this post, we briefly discuss the sources of its misleadingness:</p>
<ul>
<li>There are multiple definitions of kurtosis.
The most significant confusion arises between “kurtosis” and “excess kurtosis,”
but there are other definitions of this measure.</li>
<li>Kurtosis may work fine for unimodal distributions, but it performs not so clear for multimodal distributions.</li>
<li>The classic definition of kurtosis is not robust: it could be easily spoiled by extreme outliers.</li>
</ul>Misleading skewnesshttps://aakinshin.net/posts/misleading-skewness/Tue, 09 Nov 2021 00:00:00 +0000https://aakinshin.net/posts/misleading-skewness/<p>Skewness is a commonly used measure of the asymmetry of the probability distributions.
A typical skewness interpretation comes down to an image like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/misleading-skewness/img/skew_intro-light.png" target="_blank" class="imgldlink" alt="skew_intro">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/misleading-skewness/img/skew_intro-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/misleading-skewness/img/skew_intro-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/misleading-skewness/img/skew_intro-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>It looks extremely simple: using the skewness sign,
we get an idea of the distribution form and the arrangement of the mean and the median.
Unfortunately, it doesn’t always work as expected.
Skewness estimation could be a highly misleading metric
(even more misleading than the <a href="https://aakinshin.net/posts/misleading-stddev/">standard deviation</a>).
In this post, I discuss four sources of its misleadingness:</p>
<ul>
<li>“Skewness” is a generic term; it has multiple definitions.
When a skewness value is presented, you can’t always guess the underlying equation without additional details.</li>
<li>Skewness is “designed” for unimodal distributions; it’s meaningless in the case of multimodality.</li>
<li>Most default skewness definitions are not robust: a single outlier could completely distort the skewness value.</li>
<li>We can’t make conclusions about the locations of the mean and the median based on the skewness sign.</li>
</ul>Greenwald-Khanna quantile estimatorhttps://aakinshin.net/posts/greenwald-khanna-quantile-estimator/Tue, 02 Nov 2021 00:00:00 +0000https://aakinshin.net/posts/greenwald-khanna-quantile-estimator/<p>The Greenwald-Khanna quantile estimator is a classic sequential quantile estimator
which has the following features:</p>
<ul>
<li>It allows estimating quantiles with respect to the given precision <span class="math inline">\(\epsilon\)</span>.</li>
<li>It requires <span class="math inline">\(O(\frac{1}{\epsilon} log(\epsilon N))\)</span> memory in the worst case.</li>
<li>It doesn’t require knowledge of the total number of elements in the sequence
and the positions of the requested quantiles.</li>
</ul>
<p>In this post,
I briefly explain the basic idea of the underlying data structure,
and share a copy-pastable C# implementation.
At the end of the post, I discuss some important implementation decisions
that are unclear from the original paper,
but heavily affect the estimator accuracy.</p>P² quantile estimator rounding issuehttps://aakinshin.net/posts/p2-quantile-estimator-rounding-issue/Tue, 26 Oct 2021 00:00:00 +0000https://aakinshin.net/posts/p2-quantile-estimator-rounding-issue/<p><strong>Update: the estimator accuracy could be improved using a bunch of <a href="https://aakinshin.net/tags/research-p2qe/">patches</a>.</strong></p>
<p>The P² quantile estimator is a sequential estimator that uses <span class="math inline">\(O(1)\)</span> memory.
Thus, for the given sequence of numbers, it allows estimating quantiles without storing values.
I already wrote <a href="https://aakinshin.net/posts/p2-quantile-estimator/">a blog post</a> about this approach and
<a href="https://github.com/AndreyAkinshin/perfolizer/commit/9e9ff80a4d097fe4c0814ca51c7fbe942763e308">added</a>
its implementation in <a href="https://github.com/AndreyAkinshin/perfolizer">perfolizer</a>.
Recently, I got a <a href="https://github.com/AndreyAkinshin/perfolizer/issues/8">bug report</a>
that revealed a flaw of the <a href="https://doi.org/10.1145/4372.4378">original paper</a>.
In this post, I’m going to briefly discuss this issue and the corresponding fix.</p>Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given widthhttps://aakinshin.net/posts/thdqe-hdi/Tue, 19 Oct 2021 00:00:00 +0000https://aakinshin.net/posts/thdqe-hdi/<p><strong>This post aggregates research from <a href="https://aakinshin.net/tags/research-thdqe/">several blog posts</a> that I published during this year.
It presents an overview of
the Trimmed Harrell-Davis quantile estimator based on the highest density interval of the given width.
The <a href="https://aakinshin.net/posts/pub-thdqe/">corresponding paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>Traditional quantile estimators that are based on one or two order statistics are a common way to estimate
distribution quantiles based on the given samples.
These estimators are robust, but their statistical efficiency is not always good enough.
A more efficient alternative is the Harrell-Davis quantile estimator which uses
a weighted sum of all order statistics.
Whereas this approach provides more accurate estimations for the light-tailed distributions, it’s not robust.
To be able to customize the trade-off between statistical efficiency and robustness,
we could consider <em>a trimmed modification of the Harrell-Davis quantile estimator</em>.
In this approach, we discard order statistics with low weights according to
the highest density interval of the beta distribution.</p>Optimal window of the trimmed Harrell-Davis quantile estimator, Part 2: Trying Planck-taper windowhttps://aakinshin.net/posts/thdqe-ow2/Tue, 12 Oct 2021 00:00:00 +0000https://aakinshin.net/posts/thdqe-ow2/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the <a href="https://aakinshin.net/posts/thdqe-ow1/">previous post</a>,
I discussed the problem of non-smooth quantile-respectful density estimation (QRDE)
which is generated by the trimmed Harrell-Davis quantile estimator
based on the highest density interval of the given width.
I assumed that non-smoothness was caused by a non-smooth rectangular window
which was used to build the truncated beta distribution.
In this post, we are going to try another option: the Planck-taper window.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/thdqe-ow2/img/qrde-light.png" target="_blank" class="imgldlink" alt="qrde">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/thdqe-ow2/img/qrde-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/thdqe-ow2/img/qrde-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/thdqe-ow2/img/qrde-light.png">
</picture>
</a>
</div>
</div>
<br />Optimal window of the trimmed Harrell-Davis quantile estimator, Part 1: Problems with the rectangular windowhttps://aakinshin.net/posts/thdqe-ow1/Tue, 05 Oct 2021 00:00:00 +0000https://aakinshin.net/posts/thdqe-ow1/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the previous post, we have obtained a nice version of the trimmed Harrell-Davis quantile estimator
which provides an opportunity to get a nice trade-off between robustness and statistical efficiency
of quantile estimations.
Unfortunately, it has a severe drawback.
If we build a <a href="https://aakinshin.net/posts/qrde-hd/">quantile-respectful density estimation</a> based on the suggested estimator,
we won’t get a smooth density function as in the case of the classic Harrell-Davis quantile estimator:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/thdqe-ow1/img/qrde-light.png" target="_blank" class="imgldlink" alt="qrde">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/thdqe-ow1/img/qrde-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/thdqe-ow1/img/qrde-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/thdqe-ow1/img/qrde-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>In this blog post series, we are going to find a way to improve the trimmed Harrell-Davis quantile estimator
so that it gives a smooth density function and keeps its advantages in terms of robustness and statistical efficiency.</p>Beta distribution highest density interval of the given widthhttps://aakinshin.net/posts/beta-hdi/Tue, 28 Sep 2021 00:00:00 +0000https://aakinshin.net/posts/beta-hdi/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In one of <a href="https://aakinshin.net/posts/kosqe5/">the previous posts</a>, I discussed the idea of the trimmed Harrell-Davis quantile estimator
based on the highest density interval of the given width.
Since the Harrell-Davis quantile estimator uses the Beta distribution,
we should be able to find the beta distribution highest density interval of the given width.
In this post, I will show how to do this.</p>Quantile estimators based on k order statistics, Part 8: Winsorized Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/kosqe8/Tue, 21 Sep 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe8/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the <a href="https://aakinshin.net/posts/kosqe7/">previous post</a>, we have discussed
the trimmed modification of the Harrell-Davis quantile estimator
based on the highest density interval of size <span class="math inline">\(\sqrt{n}/n\)</span>.
This quantile estimator showed a decent level of statistical efficiency.
However, the research wouldn’t be complete without comparison with the winsorized modification.
Let’s fix it!</p>Quantile estimators based on k order statistics, Part 7: Optimal threshold for the trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/kosqe7/Tue, 14 Sep 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe7/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the <a href="https://aakinshin.net/posts/kosqe6/">previous post</a>, we have obtained a nice quantile estimator.
To be specific, we considered a trimmed modification of the Harrell-Davis quantile estimator
based on the highest density interval of the given size.
The interval size is a parameter that controls the trade-off between statistical efficiency and robustness.
While it’s nice to have the ability to control this trade-off, there is also a need for the default value,
which could be used as a starting point
when we have neither estimator breakdown point requirements nor prior knowledge about distribution properties.</p>
<p>After a series of unsuccessful attempts, it seems that I have found an acceptable solution.
We should build the new estimator based on <span class="math inline">\(\sqrt{n}/n\)</span> order statistics.
In this post, I’m going to briefly explain the idea behind the suggested estimator and
share some numerical simulations that compare the proposed estimator
and the classic Harrell-Davis quantile estimator.</p>Quantile estimators based on k order statistics, Part 6: Continuous trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/kosqe6/Tue, 07 Sep 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe6/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In my <a href="https://aakinshin.net/posts/kosqe5/">previous post</a>,
I tried the idea of using the trimmed modification of the Harrell-Davis quantile estimator
based on the highest density interval of the given width.
The width was defined so that it covers exactly k order statistics (the width equals <span class="math inline">\((k-1)/n\)</span>).
I was pretty satisfied with the result and decided to continue evolving this approach.
While “k order statistics” is a good mental model that described the trimmed interval,
it doesn’t actually require an integer k.
In fact, we can use any real number as the trimming percentage.</p>
<p>In this post, we are going to perform numerical simulations that check the statistical efficiency
of the trimmed Harrell-Davis quantile estimator with different trimming percentages.</p>Quantile estimators based on k order statistics, Part 5: Improving trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/kosqe5/Tue, 31 Aug 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe5/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>During the last several months,
I have been experimenting with different variations of the trimmed Harrell-Davis quantile estimator.
<a href="https://aakinshin.net/posts/trimmed-hdqe/">My original idea</a>
of using the highest density interval based on the fixed area percentage (e.g., HDI 95% or HDI 99%)
led to a set of problems with <a href="https://aakinshin.net/posts/thdqe-overtrimming/">overtrimming</a>.
I tried to solve them with <a href="https://aakinshin.net/posts/customized-wthdqe/">manually customized</a> trimming strategy,
but this approach turned out to be too inconvenient;
it was too hard to come up with <a href="https://aakinshin.net/posts/thdqe-threshold/">optimal thresholds</a>.
One of the main problems was about the suboptimal number of elements
that we actually aggregate to obtain the quantile estimation.
So, I decided to try an <a href="https://aakinshin.net/posts/kosqe1/">approach that involves exactly k order statistics</a>.
The idea was so promising,
but numerical simulations <a href="https://aakinshin.net/posts/kosqe4/">haven’t shown</a> the appropriate efficiency level.</p>
<p>This bothered me the whole week.
It sounded so reasonable to trim the Harrell-Davis quantile estimator using exactly k order statistics.
Why didn’t this work as expected?
Finally, I have found a fatal flaw in <a href="https://aakinshin.net/posts/kosqe4/">my previous approach</a>:
while it was a good idea to fix the size of the trimming window,
I mistakenly chose its location following the equation from the Hyndman-Fan Type 7 quantile estimator!</p>
<p>In this post, we fix this problem and try another modification of the trimmed Harrell-Davis quantile estimator based on
k order statistics <strong>and</strong> highest density intervals at the same time.</p>Quantile estimators based on k order statistics, Part 4: Adopting trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/kosqe4/Tue, 24 Aug 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe4/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the previous posts, I discussed various aspects of
<a href="https://aakinshin.net/posts/kosqe1/">quantile estimators based on k order statistics</a>.
I already tried a few weight functions that aggregate the sample values to the quantile estimators
(see posts about <a href="https://aakinshin.net/posts/kosqe2/">an extension of the Hyndman-Fan Type 7 equation</a> and
about <a href="https://aakinshin.net/posts/kosqe3/">adjusted regularized incomplete beta function</a>).
In this post, I continue my experiments and try to adopt the
<a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed modifications of the Harrell-Davis quantile estimator</a> to this approach.</p>Quantile estimators based on k order statistics, Part 3: Playing with the Beta functionhttps://aakinshin.net/posts/kosqe3/Tue, 17 Aug 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe3/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the previous two posts, I discussed the idea of quantile estimators based on k order statistics.
A already covered the <a href="https://aakinshin.net/posts/kosqe1/">motivation behind this idea</a>
and the statistical efficiency of such estimators using the <a href="https://aakinshin.net/posts/kosqe2/">extended Hyndman-Fan equations</a>
as a weight function.
Now it’s time to experiment with the Beta function as a primary way to aggregate k order statistics
into a single quantile estimation!</p>Quantile estimators based on k order statistics, Part 2: Extending Hyndman-Fan equationshttps://aakinshin.net/posts/kosqe2/Tue, 10 Aug 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe2/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In the <a href="https://aakinshin.net/posts/kosqe1/">previous post</a>,
I described the idea of using quantile estimators based on k order statistics.
Potentially, such estimators could be more robust than estimators based on all samples elements (like
Harrell-Davis,
<a href="https://aakinshin.net/posts/sfakianakis-verginis-quantile-estimator/">Sfakianakis-Verginis</a>, or
<a href="https://aakinshin.net/posts/navruz-ozdemir-quantile-estimator/">Navruz-Özdemir</a>)
and more statistically efficient than traditional quantile estimators (based on 1 or 2 order statistics).
Moreover, we should be able to control this trade-off based on the business requirements
(e.g., setting the desired breakdown point).</p>
<p>The only challenging thing here is choosing the weight function
that aggregates k order statistics to a single quantile estimation.
We are going to try several options, perform Monte-Carlo simulations for each of them, and compare the results.
A reasonable starting point is an extension of the traditional quantile estimators.
In this post, we are going to extend the Hyndman-Fan Type 7 quantile estimator
(nowadays, it’s one of the most popular estimators).
It estimates quantiles as a linear interpolation of two subsequent order statistics.
We are going to make some modifications, so a new version is going to be based on k order statistics.</p>
<p><strong>Spoiler: this approach doesn’t seem like an optimal one.</strong>
I’m pretty disappointed with its statistical efficiency on samples from light-tailed distributions.
So, what’s the point of writing a blog post about an inefficient approach?
Because of the following reasons:</p>
<ol>
<li>I believe it’s crucial to share negative results.
Sometimes, knowledge about approaches that don’t work
could be more important than knowledge about more effective techniques.
Negative results give you a broader view of the problem
and protect you from wasting your time on potential promising (but not so useful) ideas.</li>
<li>Negative results improve research completeness.
When we present an approach, it’s essential to not only show why it solves problems well,
but also why it solves problems better than other similar approaches.</li>
<li>While I wouldn’t recommend my extension of the Hyndman-Fan Type 7 quantile estimator to the k order statistics case
as the default quantile estimator, there are some specific cases where it could be useful.
For example, if we estimate the median based on small samples from a symmetric light-tailed distribution,
it could outperform not only the original version but also the Harrell-Davis quantile estimator.
The “negativity” of the negative results always exists in a specific context.
So, there may be cases when negative results for the general case transform to positive results
for a particular niche problem.</li>
<li>Finally, it’s my personal blog, so I have the freedom to write on any topic I like.
My blog posts are not publications to scientific journals (which typically don’t welcome negative results),
but rather research notes about conducted experiments.
It’s important for me to keep records of all the experiments I perform regardless of the usefulness of the results.</li>
</ol>
<p>So, let’s briefly look at the results of this not-so-useful approach.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/kosqe2/img/LightAndHeavy__N15_Efficiency-light.png" target="_blank" class="imgldlink" alt="LightAndHeavy__N15_Efficiency">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/kosqe2/img/LightAndHeavy__N15_Efficiency-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/kosqe2/img/LightAndHeavy__N15_Efficiency-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/kosqe2/img/LightAndHeavy__N15_Efficiency-light.png">
</picture>
</a>
</div>
</div>
<br />Quantile estimators based on k order statistics, Part 1: Motivationhttps://aakinshin.net/posts/kosqe1/Tue, 03 Aug 2021 00:00:00 +0000https://aakinshin.net/posts/kosqe1/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>It’s not easy to choose a good quantile estimator.
In my previous posts, I considered several groups of quantile estimators:</p>
<ul>
<li>Quantile estimators based 1 or 2 order statistics (Hyndman-Fan Type1-9)</li>
<li>Quantile estimators based on all order statistics
(the Harrell-Davis quantile estimator,
the <a href="https://aakinshin.net/posts/sfakianakis-verginis-quantile-estimator/">Sfakianakis-Verginis quantile estimator</a>, and
the <a href="https://aakinshin.net/posts/navruz-ozdemir-quantile-estimator/">Navruz-Özdemir quantile estimator</a>)</li>
<li>Quantile estimators based on a variable number of order statistics
(the <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed</a> and <a href="https://aakinshin.net/posts/winsorized-hdqe/">winsorized</a> modifications
of the Harrell-Davis quantile estimator)</li>
</ul>
<p>Unfortunately, all of these estimators have significant drawbacks
(e.g., poor statistical efficiency or poor robustness).
In this post, I want to discuss all of the advantages and disadvantages of each approach
and suggest another family of quantile estimators that are based on k order statistics.</p>Avoiding over-trimming with the trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/thdqe-overtrimming/Tue, 27 Jul 2021 00:00:00 +0000https://aakinshin.net/posts/thdqe-overtrimming/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>Previously, I already discussed the
<a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed modification of the Harrell-Davis quantile estimator</a> several times.
I performed several numerical simulations that compare the statistical efficiency of this estimator
with the efficiency of the <a href="https://aakinshin.net/posts/wthdqe-efficiency/">classic Harrell-Davis quantile estimator</a> (HDQE)
and its <a href="https://aakinshin.net/posts/winsorized-hdqe/">winsorized modification</a>;
I showed how we can improve the efficiency using <a href="https://aakinshin.net/posts/customized-wthdqe/">custom trimming strategies</a>
and how to choose a <a href="https://aakinshin.net/posts/thdqe-threshold/">good trimming threshold value</a>.</p>
<p>In the heavy-tailed cases, the trimmed HDQE provides better estimations than the classic HDQE
because of its higher breakdown point.
However, in the light-tailed cases, we could get efficiency that is worse than
the baseline Hyndman-Fan Type 7 (HF7) quantile estimator.
In many cases, such an effect arises because of the over-trimming effect.
If the trimming percentage is too high or if the evaluated quantile is too far from the median,
the trimming strategy based on the highest-density interval may lead to an estimation
that is based on single order statistics.
In this case, we get an efficiency level similar to the Hyndman-Fan Type 1-3 quantile estimators
(which are also based on single order statistics).
In the light-tailed case, such a result is less preferable than Hyndman-Fan Type 4-9 quantile estimators
(which are based on two subsequent order statistics).</p>
<p>In order to improve the situation, we could introduce the lower bound for the number of order statistics
that contribute to the final quantile estimations.
In this post, I look at some numerical simulations
that compare trimmed HDQEs with different lower bounds.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/thdqe-overtrimming/img/LightAndHeavy__N05_Efficiency-light.png" target="_blank" class="imgldlink" alt="LightAndHeavy__N05_Efficiency">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/thdqe-overtrimming/img/LightAndHeavy__N05_Efficiency-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/thdqe-overtrimming/img/LightAndHeavy__N05_Efficiency-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/thdqe-overtrimming/img/LightAndHeavy__N05_Efficiency-light.png">
</picture>
</a>
</div>
</div>
<br />Optimal threshold of the trimmed Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/thdqe-threshold/Tue, 20 Jul 2021 00:00:00 +0000https://aakinshin.net/posts/thdqe-threshold/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>The traditional quantile estimators (which are based on 1 or 2 order statistics) have great robustness.
However, the statistical efficiency of these estimators is not so great.
The Harrell-Davis quantile estimator has much better efficiency (at least in the light-tailed case),
but it’s not robust (because it calculates a weighted sum of all sample values).
I already wrote a <a href="https://aakinshin.net/posts/trimmed-hdqe/">post about trimmed Harrell-Davis quantile estimator</a>:
this approach suggest dropping some of the low-weight sample values to improve robustness
(keeping good statistical efficiency).
I also perform a numerical simulations that <a href="https://aakinshin.net/posts/wthdqe-efficiency/">compare efficiency</a>
of the original Harrell-Davis quantile estimator against its trimmed and winsorized modifications.
It’s time to discuss how to choose the optimal trimming threshold
and how it affects the estimator efficiency.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/thdqe-threshold/img/LightAndHeavy__N40_Efficiency-light.png" target="_blank" class="imgldlink" alt="LightAndHeavy__N40_Efficiency">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/thdqe-threshold/img/LightAndHeavy__N40_Efficiency-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/thdqe-threshold/img/LightAndHeavy__N40_Efficiency-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/thdqe-threshold/img/LightAndHeavy__N40_Efficiency-light.png">
</picture>
</a>
</div>
</div>
<br />Estimating quantile confidence intervals: Maritz-Jarrett vs. jackknifehttps://aakinshin.net/posts/maritz-jarrett-vs-jackknife/Tue, 13 Jul 2021 00:00:00 +0000https://aakinshin.net/posts/maritz-jarrett-vs-jackknife/<p>When it comes to estimating quantiles of the given sample,
my estimator of choice is the Harrell-Davis quantile estimator
(to be more specific, its <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed version</a>).
If I need to get a confidence interval for the obtained quantiles,
I use the <a href="https://aakinshin.net/posts/weighted-quantiles-ci/#the-maritz-jarrett-method">Maritz-Jarrett method</a>
because it provides a <a href="https://aakinshin.net/posts/quantile-ci-coverage/">decent coverage percentage</a>.
Both approaches work pretty nicely together.</p>
<p>However, in the original paper by <a href="https://doi.org/10.2307/2335999">Harrell and Davis (1982)</a>,
the authors suggest using the jackknife variance estimator in order to get the confidence intervals.
The obvious question here is which approach better: the Maritz-Jarrett method or the jackknife estimator?
In this post, I perform a numerical simulation that compares both techniques using different distributions.</p>Using Kish's effective sample size with weighted quantileshttps://aakinshin.net/posts/kish-ess-weighted-quantiles/Tue, 06 Jul 2021 00:00:00 +0000https://aakinshin.net/posts/kish-ess-weighted-quantiles/<p>In my previous posts, I described how to calculate
<a href="https://aakinshin.net/posts/weighted-quantiles/">weighted quantiles</a> and
their <a href="https://aakinshin.net/posts/weighted-quantiles-ci/">confidence intervals</a>
using the Harrell-Davis quantile estimator.
This powerful technique allows applying
<a href="https://aakinshin.net/posts/quantile-exponential-smoothing/">quantile exponential smoothing</a> and
<a href="https://aakinshin.net/posts/dispersion-exponential-smoothing/">dispersion exponential smoothing</a> for
time series in order to get its moving properties.</p>
<p>When we work with weighted samples, we need a way to calculate the
<a href="https://en.wikipedia.org/wiki/Effective_sample_size">effective samples size</a>.
Previously, I used the sum of all weights normalized by the maximum weight.
In most cases, it worked OK.</p>
<p>Recently, <a href="https://www.soz.unibe.ch/about_us/people/prof_dr_jann_ben/index_eng.html">Ben Jann</a> pointed out
that it would be better to use the Kish’s formula to calculate the effective sample size.
In this post, you find the formula and a few numerical simulations that illustrate the actual impact of
the underlying sample size formula.</p>Partial binning compression of performance serieshttps://aakinshin.net/posts/partial-binning-compression/Tue, 29 Jun 2021 00:00:00 +0000https://aakinshin.net/posts/partial-binning-compression/<p>Let’s start with a problem from real life.
Imagine we have thousands of application components that should be initialized.
We care about the total initialization time of the whole application,
so we want to automatically track the slowest components using a continuous integration (CI) system.
The easiest way to do it is to measure the initialization time of each component in each CI build
and save all the measurements to a database.
Unfortunately, if the total number of components is huge, the overall artifact size may be quite extensive.
Thus, this approach may introduce an unwanted negative impact on the database size and data processing time.</p>
<p>However, we don’t actually need all the measurements.
We want to track only the slowest components.
Typically, it’s possible to introduce a reasonable threshold that defines such components.
For example, we can say that all components that are initialized in less than 1ms are “fast enough,”
so there is no need to know the exact initialization time for them.
Since these time values are insignificant, we can just omit all the measurements below the given thresholds.
This allows to significantly reduce the data traffic without losing any important information.</p>
<p>The suggested trick can be named <em>partial binning compression</em>.
Indeed, we introduce a single bin (perform <em>binning</em>) and
omit all the values inside this bin (perform <em>compression</em>).
On the other hand, we don’t build an honest histogram since we keep all the raw values outside the given bin
(the binning is <em>partial</em>).</p>
<p>Let’s discuss a few aspects of using partial binning compression.</p>Calculating gamma effect size for samples with zero median absolute deviationhttps://aakinshin.net/posts/zero-mad-gamma-es/Tue, 22 Jun 2021 00:00:00 +0000https://aakinshin.net/posts/zero-mad-gamma-es/<p>In previous posts, I discussed the <a href="https://aakinshin.net/posts/nonparametric-effect-size/">gamma effect size</a>
which is a Cohen’s d-consistent nonparametric and robust measure of the effect size.
Also, I discussed <a href="https://aakinshin.net/posts/nonparametric-effect-size2/">various ways to customize this metric</a>
and adjust it to different kinds of business requirements.
In this post, I want to briefly cover one more corner case that requires special adjustments.
We are going to discuss the situation when the median absolute deviation is zero.</p>Discrete performance distributionshttps://aakinshin.net/posts/discrete-performance-distributions/Tue, 15 Jun 2021 00:00:00 +0000https://aakinshin.net/posts/discrete-performance-distributions/<p>When we collect software performance measurements, we get a bunch of time intervals.
Typically, we tend to interpret time values as continuous values.
However, the obtained values are actually discrete due to the limited resolution of our measurement tool.
In simple cases, we can treat these discrete values as continuous and get meaningful results.
Unfortunately, discretization may produce strange phenomena like pseudo-multimodality or zero dispersion.
If we want to set up a reliable system that automatically analyzes such distributions,
we should be aware of such problems so we could correctly handle them.</p>
<p>In this post, I want to share a few of discretization problems in real-life performance data sets
(based on the <a href="https://www.jetbrains.com/rider/">Rider</a> performance tests).</p>Customization of the nonparametric Cohen's d-consistent effect sizehttps://aakinshin.net/posts/nonparametric-effect-size2/Tue, 08 Jun 2021 00:00:00 +0000https://aakinshin.net/posts/nonparametric-effect-size2/<p>One year ago, I publish a post called <a href="https://aakinshin.net/posts/nonparametric-effect-size/">Nonparametric Cohen's d-consistent effect size</a>.
During this year, I got a lot of internal and external feedback from
my own statistical experiments and
<a href="https://twitter.com/ViljamiSairanen/status/1400457118340108293">people</a>
<a href="https://sherbold.github.io/autorank/autorank/">who</a>
<a href="https://github.com/Ramon-Diaz/Thesis-Project/blob/85df6b11050c7e05c4394d873585f701a7e3f32e/_util.py#L100">tried</a>
to use the suggested approach.
It seems that the nonparametric version of Cohen’s d works much better with real-life not-so-normal data.
While the classic Cohen’s d based on
the non-robust arithmetic mean and
the <a href="https://aakinshin.net/posts/misleading-stddev/">non-robust standard deviation</a>
can be easily <a href="https://aakinshin.net/posts/cohend-and-outliers/">corrupted by a single outlier</a>,
my approach is much more resistant to unexpected extreme values.
Also, it allows exploring
<a href="https://aakinshin.net/posts/comparing-distributions-using-gamma-es/">the difference between specific quantiles of considered samples</a>,
which can be useful in the non-parametric case.</p>
<p>However, I wasn’t satisfied with the results of all of my experiments.
While I still like the basic idea
(replace the mean with the median; replace the standard deviation with the median absolute deviation),
it turned out that the final results heavily depend on the used quantile estimator.
To be more specific, the original Harrell-Davis quantile estimator is not always optimal;
in most cases, it’s better to replace it with its <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed</a> modification.
However, the particular choice of the quantile estimators depends on the situation.
Also, the consistency constant for the median absolute deviation
should be adjusted according to the current sample size and the used quantile estimator.
Of course, it also can be replaced by other dispersion estimators
that can be used as consistent estimators of the standard deviation.</p>
<p>In this post, I want to get a brief overview of possible customizations of the suggested metrics.</p>Robust alternative to statistical efficiencyhttps://aakinshin.net/posts/robust-statistical-efficiency/Tue, 01 Jun 2021 00:00:00 +0000https://aakinshin.net/posts/robust-statistical-efficiency/<p>Statistical efficiency is a common measure of the quality of an estimator.
Typically, it’s expressed via the mean square error (<span class="math inline">\(\operatorname{MSE}\)</span>).
For the given estimator <span class="math inline">\(T\)</span> and the true parameter value <span class="math inline">\(\theta\)</span>,
the <span class="math inline">\(\operatorname{MSE}\)</span> can be expressed as follows:</p>
<p><span class="math display">\[\operatorname{MSE}(T) = \operatorname{E}[(T-\theta)^2]
\]</span></p>
<p>In numerical simulations, the <span class="math inline">\(\operatorname{MSE}\)</span> can’t be used as a robust metric
because its breakdown point is zero
(a corruption of a single measurement leads to a corrupted result).
Typically, it’s not a problem for light-tailed distributions.
Unfortunately, in the heavy-tailed case,
the <span class="math inline">\(\operatorname{MSE}\)</span> becomes an unreliable and unreproducible metric
because it can be easily spoiled by a single outlier.</p>
<p>I suggest an alternative way to compare statistical estimators.
Instead of using non-robust <span class="math inline">\(\operatorname{MSE}\)</span>,
we can use robust quantile estimations of the absolute error distribution.
In this post, I want to share numerical simulations
that show a problem of irreproducible <span class="math inline">\(\operatorname{MSE}\)</span> values
and how they can be replaced by reproducible quantile values.</p>Improving the efficiency of the Harrell-Davis quantile estimator for special cases using custom winsorizing and trimming strategieshttps://aakinshin.net/posts/customized-wthdqe/Tue, 25 May 2021 00:00:00 +0000https://aakinshin.net/posts/customized-wthdqe/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>Let’s say we want to
<strong>estimate the median</strong>
based on a <strong>small sample</strong> (3 <span class="math inline">\(\leq n \leq 7\)</span>)
from a <strong>right-skewed heavy-tailed distribution</strong>
with <strong>high statistical efficiency</strong>.</p>
<p>The traditional median estimator is the most robust estimator, but it’s not the most efficient one.
Typically, the Harrell-Davis quantile estimator provides better efficiency,
but it’s not robust (its breakdown point is zero),
so it may have worse efficiency in the given case.
The <a href="https://aakinshin.net/posts/winsorized-hdqe/">winsorized</a> and <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed</a>
modifications of the Harrell-Davis quantile estimator provide a good trade-off
between efficiency and robustness, but they require a proper winsorizing/trimming rule.
A reasonable choice of such a rule for medium-size samples is based on the highest density interval of the Beta function
(as described <a href="https://aakinshin.net/posts/winsorized-hdqe/">here</a>).
Unfortunately, this approach may be suboptimal for small samples.
E.g., if we use the 99% highest density interval to estimate the median,
it starts to trim sample values only for <span class="math inline">\(n \geq 8\)</span>.</p>
<p>In this post, we are going to discuss custom winsorizing/trimming strategies for special cases of the quantile estimation problem.</p>Comparing the efficiency of the Harrell-Davis, Sfakianakis-Verginis, and Navruz-Özdemir quantile estimatorshttps://aakinshin.net/posts/hd-sv-no-efficiency/Tue, 18 May 2021 00:00:00 +0000https://aakinshin.net/posts/hd-sv-no-efficiency/<p>In the previous posts, I discussed the statistical efficiency of different quantile estimators
(<a href="https://aakinshin.net/posts/hdqe-efficiency/">Efficiency of the Harrell-Davis quantile estimator</a> and
<a href="https://aakinshin.net/posts/wthdqe-efficiency/">Efficiency of the winsorized and trimmed Harrell-Davis quantile estimators</a>).</p>
<p>In this post, I continue this research and compare the efficiency of
the Harrell-Davis quantile estimator,
the <a href="https://aakinshin.net/posts/sfakianakis-verginis-quantile-estimator/">Sfakianakis-Verginis quantile estimators</a>, and
the <a href="https://aakinshin.net/posts/navruz-ozdemir-quantile-estimator/">Navruz-Özdemir quantile estimator</a>.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/hd-sv-no-efficiency/img/LightAndHeavy_N10_Efficiency-light.png" target="_blank" class="imgldlink" alt="LightAndHeavy_N10_Efficiency">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/hd-sv-no-efficiency/img/LightAndHeavy_N10_Efficiency-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/hd-sv-no-efficiency/img/LightAndHeavy_N10_Efficiency-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/hd-sv-no-efficiency/img/LightAndHeavy_N10_Efficiency-light.png">
</picture>
</a>
</div>
</div>
<br />Dispersion exponential smoothinghttps://aakinshin.net/posts/dispersion-exponential-smoothing/Tue, 11 May 2021 00:00:00 +0000https://aakinshin.net/posts/dispersion-exponential-smoothing/<p>In this <a href="https://aakinshin.net/posts/quantile-exponential-smoothing/">previous post</a>,
I showed how to apply exponential smoothing to quantiles
using the <a href="https://aakinshin.net/posts/weighted-quantiles/">weighted Harrell-Davis quantile estimator</a>.
This technique allows getting smooth and stable moving median estimations.
In this post, I’m going to discuss how to use the same approach
to estimate moving dispersion.</p>Quantile exponential smoothinghttps://aakinshin.net/posts/quantile-exponential-smoothing/Tue, 04 May 2021 00:00:00 +0000https://aakinshin.net/posts/quantile-exponential-smoothing/<p>One of the popular problems in time series analysis is estimating the moving “average” value.
Let’s define the “average” as a central tendency metric like the mean or the median.
When we talk about the moving value, we assume that we are interested in
the average value “at the end” of the time series
instead of the average of all available observations.</p>
<p>One of the most straightforward approaches to estimate the moving average is the <em>simple moving mean</em>.
Unfortunately, this approach is not robust: outliers can instantly spoil the evaluated mean value.
As an alternative, we can consider <em>simple moving median</em>.
I already discussed a few of such methods:
<a href="https://aakinshin.net/posts/mp2-quantile-estimator/">the MP² quantile estimator</a> and
<a href="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator2/">a moving quantile estimator based on partitioning heaps</a>
(a modification of the Hardle-Steiger method).
When we talk about <em>simple moving averages</em>, we typically assume
that we estimate the average value over the last <span class="math inline">\(k\)</span> observations (<span class="math inline">\(k\)</span> is the <em>window size</em>).
This approach is also known as <em>unweighted moving averages</em> because
all target observations have the same weight.</p>
<p>As an alternative to the simple moving average, we can also consider the <em>weighted moving average</em>.
In this case, we assign a weight for each observation and aggregate the whole time series according to these weights.
A famous example of such a weight function is <em>exponential smoothing</em>.
And the simplest form of exponential smoothing is the <em>exponentially weighted moving mean</em>.
This approach estimates the weighted moving mean using exponentially decreasing weights.
Switching from the simple moving mean to the exponentially weighted moving mean provides some benefits
in terms of smoothness and estimation efficiency.</p>
<p>Although exponential smoothing has advantages over the simple moving mean,
it still estimates the mean value which is not robust.
We can improve the robustness of this approach if we reuse the same idea for weighted moving quantiles.
It’s possible because the quantiles also can be estimated for weighted samples.
In one of my previous posts, I <a href="https://aakinshin.net/posts/weighted-quantiles/">showed</a> how to adapt
the Hyndman-Fan Type 7 and Harrell-Davis quantile estimators to the weighted samples.
In this post, I’m going to show how we can use this technique to estimate
the weighted moving quantiles using exponentially decreasing weights.</p>Improving quantile-respectful density estimation for discrete distributions using jitteringhttps://aakinshin.net/posts/qrde-discrete/Tue, 27 Apr 2021 00:00:00 +0000https://aakinshin.net/posts/qrde-discrete/<p>In my previous posts, I already discussed the <a href="https://aakinshin.net/posts/kde-discrete/">problem</a> that arise
when we try to build the kernel density estimation (KDE) for samples with ties.
We may get such samples in real life from discrete or mixed discrete/continuous distributions.
Even if the original distribution is continuous,
we may observe artificial sample discretization due to a limited resolution of the measuring tool.
Such discretization may lead to inaccurate density plots due to undersmoothing.
The problem can be resolved using a nice technique called <em>jittering</em>.
I also discussed <a href="https://aakinshin.net/posts/discrete-sample-jittering/">how to apply</a> jittering to get a smoother version of KDE.</p>
<p>However, I’m not a huge fan of KDE because of two reasons.
The first one is the <a href="https://aakinshin.net/posts/kde-bw/">problem of choosing a proper bandwidth value</a>.
With poorly chosen bandwidth, we can easily get oversmoothing or undersmoothing even without the discretization problem.
The second one is an inconsistency between the KDE-based probability density function and evaluated sample quantiles.
It could lead to inconsistent visualizations (e.g., KDE-based violin plots with non-KDE-based quantile values)
or it could introduce problems for algorithms that require density function and quantile values at the same time.
The inconsistency could be resolved using <a href="https://aakinshin.net/posts/qrde-hd/">quantile-respectful density estimation</a> (QRDE).
This kind of estimation builds the density function which matches the evaluated sample quantiles.
To get a smooth QRDE, we also need a smooth quantile estimator like the Harrell-Davis quantile estimator.
The robustness and componential efficiency of this approach can be improved using
the <a href="https://aakinshin.net/posts/winsorized-hdqe/">winsorized</a> and <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed</a>
modifications of the Harrell-Davis quantile estimator
(which also have a <a href="https://aakinshin.net/posts/wthdqe-efficiency/">decent statistical efficiency level</a>).</p>
<p>Unfortunately, the straightforward QRDE calculation is not always applicable for samples with ties
because it’s impossible to build an “honest” density function for discrete distributions
without using the Dirac delta function.
This is a severe problem for QRDE-based algorithms like the
<a href="https://aakinshin.net/posts/lowland-multimodality-detection/">lowland multimodality detection algorithm</a>.
In this post, I will show how jittering could help to solve this problem and get a smooth QRDE on samples with ties.</p>How to build a smooth density estimation for a discrete sample using jitteringhttps://aakinshin.net/posts/discrete-sample-jittering/Tue, 20 Apr 2021 00:00:00 +0000https://aakinshin.net/posts/discrete-sample-jittering/<p>Let’s say you have a sample with tied values.
If you draw a kernel density estimation (KDE) for such a sample,
you may get a serrated pattern like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/discrete-sample-jittering/img/intro-light.png" target="_blank" class="imgldlink" alt="intro">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/discrete-sample-jittering/img/intro-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/discrete-sample-jittering/img/intro-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/discrete-sample-jittering/img/intro-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>KDE requires samples from continuous distributions
while tied values arise in discrete or mixture distributions.
Even if the original distribution is continuous,
you may observe artificial sample discretization due to a limited resolution of the measuring tool.
This effect may lead to distorted density plots like in the above picture.</p>
<p>The problem could be solved using a nice technique called <em>jittering</em>.
In the simplest case, jittering just adds random noise to each measurement.
Such a trick removes all ties from the sample and allows building a smooth density estimation.</p>
<p>However, there are many different ways to apply jittering.
The trickiest question here is how to choose proper noise values.
In this post, I want to share one of my favorite jittering approaches.
It generates a non-randomized noise pattern with a low risk of noticeable sample corruption.</p>Kernel density estimation and discrete valueshttps://aakinshin.net/posts/kde-discrete/Tue, 13 Apr 2021 00:00:00 +0000https://aakinshin.net/posts/kde-discrete/<p>Kernel density estimation (KDE) is a popular technique of data visualization.
Based on the given sample, it allows estimating the probability density function (PDF) of the underlying distribution.
Here is an example of KDE for <code>x = {3.82, 4.61, 4.89, 4.91, 5.31, 5.6, 5.66, 7.00, 7.00, 7.00}</code>
(normal kernel, Sheather & Jones bandwidth selector):</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/kde-discrete/img/intro-light.png" target="_blank" class="imgldlink" alt="intro">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/kde-discrete/img/intro-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/kde-discrete/img/intro-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/kde-discrete/img/intro-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>KDE is a simple and straightforward way to build a PDF, but it’s not always the best one.
In addition to my <a href="https://aakinshin.net/posts/kde-bw/">concerns about bandwidth selection</a>,
continuous use of KDE creates an illusion that all distributions are smooth and continuous.
In practice, it’s not always true.</p>
<p>In the above picture, the distribution looks pretty continuous.
However, the picture hides the fact that we have three <code>7.00</code> elements in the original sample.
With continuous distributions, the probability of getting tied observations (that have the same value) is almost zero.
If a sample contains ties, we are most likely working with
either a discrete distribution or a mixture of discrete and continuous distributions.
A KDE for such a sample may significantly differ from the actual PDF.
Thus, this technique may mislead us instead of providing insights about the true underlying distribution.</p>
<p>In this post, we discuss the usage of PDF and PMF with continuous and discrete distributions.
Also, we look at examples of corrupted density estimation plots for distributions with discrete features.</p>Efficiency of the winsorized and trimmed Harrell-Davis quantile estimatorshttps://aakinshin.net/posts/wthdqe-efficiency/Tue, 06 Apr 2021 00:00:00 +0000https://aakinshin.net/posts/wthdqe-efficiency/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In previous posts, I suggested two modifications of the Harrell-Davis quantile estimator:
<a href="https://aakinshin.net/posts/winsorized-hdqe/">winsorized</a> and <a href="https://aakinshin.net/posts/trimmed-hdqe/">trimmed</a>.
Both modifications have a higher level of robustness in comparison to the original estimator.
Also, I <a href="https://aakinshin.net/posts/hdqe-efficiency/">discussed</a> the <a href="https://en.wikipedia.org/wiki/Efficiency_(statistics)">efficiency</a>
of the Harrell-Davis quantile estimator.
In this post, I’m going to continue numerical simulation and estimate the efficiency of
the winsorized and trimmed modifications.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/wthdqe-efficiency/img/LightAndHeavy_N10_Efficiency-light.png" target="_blank" class="imgldlink" alt="LightAndHeavy_N10_Efficiency">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/wthdqe-efficiency/img/LightAndHeavy_N10_Efficiency-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/wthdqe-efficiency/img/LightAndHeavy_N10_Efficiency-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/wthdqe-efficiency/img/LightAndHeavy_N10_Efficiency-light.png">
</picture>
</a>
</div>
</div>
<br />Trimmed modification of the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/trimmed-hdqe/Tue, 30 Mar 2021 00:00:00 +0000https://aakinshin.net/posts/trimmed-hdqe/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>In one of <a href="https://aakinshin.net/posts/winsorized-hdqe/">the previous posts</a>, I discussed winsorized Harrell-Davis quantile estimator.
This estimator is more robust than the classic Harrell-Davis quantile estimator.
In this post, I want to suggest another modification that may be better for some corner cases:
the <em>trimmed</em> Harrell-Davis quantile estimator.</p>Efficiency of the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/hdqe-efficiency/Tue, 23 Mar 2021 00:00:00 +0000https://aakinshin.net/posts/hdqe-efficiency/<p>One of the most essential properties of a quantile estimator is
its <a href="https://en.wikipedia.org/wiki/Efficiency_(statistics)">efficiency</a>.
In simple words, the efficiency describes the estimator accuracy.
The Harrell-Davis quantile estimator is a good option to achieve higher efficiency.
However, this estimator may provide lower efficiency in some special cases.
In this post, we will conduct a set of simulations that show the actual efficiency numbers.
We compare different distributions (symmetric and right-skewed, heavy-tailed and light-tailed),
quantiles, and sample sizes.</p>Navruz-Özdemir quantile estimatorhttps://aakinshin.net/posts/navruz-ozdemir-quantile-estimator/Tue, 16 Mar 2021 00:00:00 +0000https://aakinshin.net/posts/navruz-ozdemir-quantile-estimator/<p>The Navruz-Özdemir quantile estimator
suggests the following equation to estimate the <span class="math inline">\(p^\textrm{th}\)</span> quantile of sample <span class="math inline">\(X\)</span>:</p>
<p><span class="math display">\[\begin{split}
\operatorname{NO}_p =
& \Big( (3p-1)X_{(1)} + (2-3p)X_{(2)} - (1-p)X_{(3)} \Big) B_0 +\\
& +\sum_{i=1}^n \Big((1-p)B_{i-1}+pB_i\Big)X_{(i)} +\\
& +\Big( -pX_{(n-2)} + (3p-1)X_{(n-1)} + (2-3p)X_{(n)} \Big) B_n
\end{split}
\]</span></p>
<p>where <span class="math inline">\(B_i = B(i; n, p)\)</span> is probability mass function of the binomial distribution <span class="math inline">\(B(n, p)\)</span>,
<span class="math inline">\(X_{(i)}\)</span> are order statistics of sample <span class="math inline">\(X\)</span>.</p>
<p>In this post, I derive these equations following the paper
<a href="https://doi.org/10.1111/bmsp.12198">“A new quantile estimator with weights based on a subsampling approach”</a> (2020)
by Gözde Navruz and A. Fırat Özdemir.
Also, I add some additional explanations,
simplify the final equation,
and provide reference implementations in C# and R.</p>Sfakianakis-Verginis quantile estimatorhttps://aakinshin.net/posts/sfakianakis-verginis-quantile-estimator/Tue, 09 Mar 2021 00:00:00 +0000https://aakinshin.net/posts/sfakianakis-verginis-quantile-estimator/<p>There are dozens of different ways to estimate quantiles.
One of these ways is to use the Sfakianakis-Verginis quantile estimator.
To be more specific, it’s a family of three estimators.
If we want to estimate the <span class="math inline">\(p^\textrm{th}\)</span> quantile of sample <span class="math inline">\(X\)</span>,
we can use one of the following equations:</p>
<p><span class="math display">\[\begin{split}
\operatorname{SV1}_p =&
\frac{B_0}{2} \big( X_{(1)}+X_{(2)}-X_{(3)} \big) +
\sum_{i=1}^{n} \frac{B_i+B_{i-1}}{2} X_{(i)} +
\frac{B_n}{2} \big(- X_{(n-2)}+X_{(n-1)}-X_{(n)} \big),\\
\operatorname{SV2}_p =& \sum_{i=1}^{n} B_{i-1} X_{(i)} + B_n \cdot \big(2X_{(n)} - X_{(n-1)}\big),\\
\operatorname{SV3}_p =& \sum_{i=1}^n B_i X_{(i)} + B_0 \cdot \big(2X_{(1)}-X_{(2)}\big).
\end{split}
\]</span></p>
<p>where <span class="math inline">\(B_i = B(i; n, p)\)</span> is probability mass function of the binomial distribution <span class="math inline">\(B(n, p)\)</span>,
<span class="math inline">\(X_{(i)}\)</span> are order statistics of sample <span class="math inline">\(X\)</span>.</p>
<p>In this post, I derive these equations following the paper
<a href="https://doi.org/10.1080/03610910701790491">“A new family of nonparametric quantile estimators”</a> (2008)
by Michael E. Sfakianakis and Dimitris G. Verginis.
Also, I add some additional explanations,
reconstruct missing steps,
simplify the final equations,
and provide reference implementations in C# and R.</p>Winsorized modification of the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/winsorized-hdqe/Tue, 02 Mar 2021 00:00:00 +0000https://aakinshin.net/posts/winsorized-hdqe/<p><strong>Update: this blog post is a part of research that aimed to build a statistically efficient and robust quantile estimator.
A <a href="https://aakinshin.net/posts/pub-thdqe/">paper with final results</a> is available in <em>Communications in Statistics - Simulation and Computation</em> (DOI: <a href="https://www.tandfonline.com/doi/abs/10.1080/03610918.2022.2050396">10.1080/03610918.2022.2050396</a>).
A preprint is available on arXiv: <a href="https://arxiv.org/abs/2111.11776">arXiv:2111.11776 [stat.ME]</a>.</strong></p>
<p>The Harrell-Davis quantile estimator is one of my favorite quantile estimators
because of its <a href="https://en.wikipedia.org/wiki/Efficiency_(statistics)">efficiency</a>.
It has a small mean square error which allows getting accurate estimations.
However, it has a severe drawback: it’s not robust.
Indeed, since the estimator includes all sample elements with positive weights,
its <a href="https://en.wikipedia.org/wiki/Robust_statistics#Breakdown_point">breakdown point</a> is zero.</p>
<p>In this post, I want to suggest modifications of the Harrell-Davis quantile estimator
which increases its <em>robustness</em> keeping almost the same level of <em>efficiency</em>.</p>Misleading standard deviationhttps://aakinshin.net/posts/misleading-stddev/Tue, 23 Feb 2021 00:00:00 +0000https://aakinshin.net/posts/misleading-stddev/<p>The <a href="https://en.wikipedia.org/wiki/Standard_deviation">standard deviation</a> may be an extremely misleading metric.
Even minor deviations from normality could make it completely unreliable and deceiving.
Let me demonstrate this problem using an example.</p>
<p>Below you can see three density plots of some distributions.
Could you guess their standard deviations?</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/misleading-stddev/img/density1-light.png" target="_blank" class="imgldlink" alt="density1">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/misleading-stddev/img/density1-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/misleading-stddev/img/density1-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/misleading-stddev/img/density1-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>The correct answers are <span class="math inline">\(1.0, 3.0, 11.0\)</span>.
And here is a more challenging problem: could you match these values with the corresponding distributions?</p>Unbiased median absolute deviation based on the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/unbiased-mad-hd/Tue, 16 Feb 2021 00:00:00 +0000https://aakinshin.net/posts/unbiased-mad-hd/<p>The <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (<span class="math inline">\(\textrm{MAD}\)</span>)
is a robust measure of scale.
In the previous post, I <a href="https://aakinshin.net/posts/unbiased-mad/">showed</a>
how to use the <a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">unbiased</a>
version of the <span class="math inline">\(\textrm{MAD}\)</span> estimator
as a robust alternative to the standard deviation.
“Unbiasedness” means that such estimator’s expected value equals the true value of the standard deviation.
Unfortunately, there is such thing as the <a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff">bias–variance tradeoff</a>:
when we remove the bias of the <span class="math inline">\(\textrm{MAD}\)</span> estimator,
we increase its variance and mean squared error (<span class="math inline">\(\textrm{MSE}\)</span>).</p>
<p>In this post, I want to suggest a more <a href="https://en.wikipedia.org/wiki/Efficiency_(statistics)">efficient</a>
unbiased <span class="math inline">\(\textrm{MAD}\)</span> estimator.
It’s also a consistent estimator for the standard deviation, but it has smaller <span class="math inline">\(\textrm{MSE}\)</span>.
To build this estimator,
we should replace the classic “straightforward” median estimator with the Harrell-Davis quantile estimator
and adjust bias-correction factors.
Let’s discuss this approach in detail.</p>Unbiased median absolute deviationhttps://aakinshin.net/posts/unbiased-mad/Tue, 09 Feb 2021 00:00:00 +0000https://aakinshin.net/posts/unbiased-mad/<p>The <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (<span class="math inline">\(\textrm{MAD}\)</span>)
is a robust measure of scale.
For distribution <span class="math inline">\(X\)</span>, it can be calculated as follows:</p>
<p><span class="math display">\[\textrm{MAD} = C \cdot \textrm{median}(|X - \textrm{median}(X)|)
\]</span></p>
<p>where <span class="math inline">\(C\)</span> is a constant scale factor.
This metric can be used as a robust alternative to the standard deviation.
If we want to use the <span class="math inline">\(\textrm{MAD}\)</span> as a <a href="https://en.wikipedia.org/wiki/Consistent_estimator">consistent estimator</a>
for the standard deviation under the normal distribution,
we should set</p>
<p><span class="math display">\[C = C_{\infty} = \dfrac{1}{\Phi^{-1}(3/4)} \approx 1.4826022185056.
\]</span></p>
<p>where <span class="math inline">\(\Phi^{-1}\)</span> is the quantile function of the standard normal distribution
(or the inverse of the cumulative distribution function).
If <span class="math inline">\(X\)</span> is the normal distribution, we get <span class="math inline">\(\textrm{MAD} = \sigma\)</span> where <span class="math inline">\(\sigma\)</span> is the standard deviation.</p>
<p>Now let’s consider a sample <span class="math inline">\(x = \{ x_1, x_2, \ldots x_n \}\)</span>.
Let’s denote the median absolute deviation for a sample of size <span class="math inline">\(n\)</span> as <span class="math inline">\(\textrm{MAD}_n\)</span>.
The corresponding equation looks similar to the definition of <span class="math inline">\(\textrm{MAD}\)</span> for a distribution:</p>
<p><span class="math display">\[\textrm{MAD}_n = C_n \cdot \textrm{median}(|x - \textrm{median}(x)|).
\]</span></p>
<p>Let’s assume that <span class="math inline">\(\textrm{median}\)</span> is the straightforward definition of the median
(if <span class="math inline">\(n\)</span> is odd, the median is the middle element of the sorted sample,
if <span class="math inline">\(n\)</span> is even, the median is the arithmetic average of the two middle elements of the sorted sample).
We still can use <span class="math inline">\(C_n = C_{\infty}\)</span> for extremely large sample sizes.
However, for small <span class="math inline">\(n\)</span>, <span class="math inline">\(\textrm{MAD}_n\)</span> becomes a <a href="https://en.wikipedia.org/wiki/Bias_of_an_estimator">biased estimator</a>.
If we want to get an unbiased version, we should adjust the value of <span class="math inline">\(C_n\)</span>.</p>
<p>In this post, we look at the possible approaches and learn the way to get the exact value of <span class="math inline">\(C_n\)</span>
that makes <span class="math inline">\(\textrm{MAD}_n\)</span> unbiased estimator of the median absolute deviation for any <span class="math inline">\(n\)</span>.</p>Comparing distribution quantiles using gamma effect sizehttps://aakinshin.net/posts/comparing-distributions-using-gamma-es/Tue, 02 Feb 2021 00:00:00 +0000https://aakinshin.net/posts/comparing-distributions-using-gamma-es/<p>There are several ways to describe the difference between two distributions.
Here are a few examples:</p>
<ul>
<li>Effect sizes based on differences between means (e.g., Cohen’s d, Glass’ Δ, Hedges’ g)</li>
<li><a href="https://aakinshin.net/posts/shift-and-ratio-functions/">The shift and ration functions</a> that
estimate differences between matched quantiles.</li>
</ul>
<p>In one of the previous post, I <a href="https://aakinshin.net/posts/nonparametric-effect-size/">described</a>
the gamma effect size which is defined not for the mean but for quantiles.
In this post, I want to share a few case studies that demonstrate
how the suggested metric combines the advantages of the above approaches.</p>A single outlier could completely distort your Cohen's d valuehttps://aakinshin.net/posts/cohend-and-outliers/Tue, 26 Jan 2021 00:00:00 +0000https://aakinshin.net/posts/cohend-and-outliers/<p><a href="https://en.wikipedia.org/wiki/Effect_size#Cohen's_d">Cohen’s d</a> is a popular way to estimate
the <a href="https://en.wikipedia.org/wiki/Effect_size">effect size</a> between two samples.
It works excellent for perfectly normal distributions.
Usually, people think that slight deviations from normality
shouldn’t produce a noticeable impact on the result.
Unfortunately, it’s not always true.
In fact, a single outlier value can completely distort the result even in large samples.</p>
<p>In this post, I will present some illustrations for this problem and will show how to fix it.</p>Better moving quantile estimations using the partitioning heapshttps://aakinshin.net/posts/partitioning-heaps-quantile-estimator2/Tue, 19 Jan 2021 00:00:00 +0000https://aakinshin.net/posts/partitioning-heaps-quantile-estimator2/<p>In one of the previous posts, I <a href="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/">have discussed</a> the Hardle-Steiger method.
This algorithm allows estimating <a href="https://en.wikipedia.org/wiki/Moving_average#Moving_median">the moving median</a>
using <span class="math inline">\(O(L)\)</span> memory and <span class="math inline">\(O(log(L))\)</span> element processing complexity (where <span class="math inline">\(L\)</span> is the window size).
Also, I have shown how to adapt this approach to estimate <em>any</em> moving quantile.</p>
<p>In this post, I’m going to present further improvements.
The Hardle-Steiger method always returns the <a href="https://en.wikipedia.org/wiki/Order_statistic">order statistics</a>
which is the <span class="math inline">\(k\textrm{th}\)</span> smallest element from the sample.
It means that the estimated quantile value always equals one of the last <span class="math inline">\(L\)</span> observed numbers.
However, many of the classic quantile estimators use two elements.
For example, if we want to estimate the median for <span class="math inline">\(x = \{4, 5, 6, 7\}\)</span>,
some estimators return <span class="math inline">\(5.5\)</span> (which is the arithmetical mean of <span class="math inline">\(5\)</span> and <span class="math inline">\(6\)</span>)
instead of <span class="math inline">\(5\)</span> or <span class="math inline">\(6\)</span> (which are order statistics).</p>
<p>Let’s learn how to implement a moving version of such estimators using
the partitioning heaps from the Hardle-Steiger method.</p>MP² quantile estimator: estimating the moving median without storing valueshttps://aakinshin.net/posts/mp2-quantile-estimator/Tue, 12 Jan 2021 00:00:00 +0000https://aakinshin.net/posts/mp2-quantile-estimator/<p>In one of the previous posts, I <a href="https://aakinshin.net/posts/p2-quantile-estimator/">described</a> the P² quantile estimator.
It allows estimating quantiles on a stream of numbers without storing them.
Such sequential (streaming/online) quantile estimators are useful in software telemetry because
they help to evaluate the median and other distribution quantiles without a noticeable memory footprint.</p>
<p>After the publication, I got a lot of questions about <em>moving</em> sequential quantile estimators.
Such estimators return quantile values not for the whole stream of numbers,
but only for the recent values.
So, I <a href="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/">wrote</a> another post about
a quantile estimator based on a partitioning heaps (inspired by the Hardle-Steiger method).
This algorithm gives you the exact value of any order statistics for the last <span class="math inline">\(L\)</span> numbers
(<span class="math inline">\(L\)</span> is known as the window size).
However, it requires <span class="math inline">\(O(L)\)</span> memory, and it takes <span class="math inline">\(O(log(L))\)</span> time to process each element.
This may be acceptable in some cases.
Unfortunately, it doesn’t allow implementing low-overhead telemetry in the case of large <span class="math inline">\(L\)</span>.</p>
<p>In this post, I’m going to present a moving modification of the P² quantile estimator.
Let’s call it MP² (moving P²).
It requires <span class="math inline">\(O(1)\)</span> memory, it takes <span class="math inline">\(O(1)\)</span> to process each element,
and it supports windows of any size.
Of course, we have a trade-off with the estimation accuracy:
it returns a quantile approximation instead of the exact order statistics.
However, in most cases, the MP² estimations are pretty accurate from the practical point of view.</p>
<p>Let’s discuss MP² in detail!</p>Case study: Accuracy of the MAD estimation using the Harrell-Davis quantile estimator (Gumbel distribution)https://aakinshin.net/posts/cs-mad-hd-gumbel/Tue, 05 Jan 2021 00:00:00 +0000https://aakinshin.net/posts/cs-mad-hd-gumbel/<p>In some of my previous posts, I used
the <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (MAD)
to describe the distribution dispersion:</p>
<ul>
<li><a href="https://aakinshin.net/posts/harrell-davis-double-mad-outlier-detector/">DoubleMAD outlier detector based on the Harrell-Davis quantile estimator</a></li>
<li><a href="https://aakinshin.net/posts/nonparametric-effect-size/">Nonparametric Cohen’s d-consistent effect size</a></li>
<li><a href="https://aakinshin.net/posts/qad/">Quantile absolute deviation: estimating statistical dispersion around quantiles</a></li>
</ul>
<p>The MAD estimation depends on the chosen median estimator:
we may get different MAD values with different median estimators.
To get better accuracy,
I always encourage readers to use the Harrell-Davis quantile estimator
instead of the classic Type 7 quantile estimator.</p>
<p>In this case study, I decided to compare these two quantile estimators using
the <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a>
(it’s a good model for slightly right-skewed distributions).
According to the performed Monte Carlo simulation,
the Harrell-Davis quantile estimator always has better accuracy:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/cs-mad-hd-gumbel/img/summary-light.png" target="_blank" class="imgldlink" alt="summary">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/cs-mad-hd-gumbel/img/summary-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/cs-mad-hd-gumbel/img/summary-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/cs-mad-hd-gumbel/img/summary-light.png">
</picture>
</a>
</div>
</div>
<br />Fast implementation of the moving quantile based on the partitioning heapshttps://aakinshin.net/posts/partitioning-heaps-quantile-estimator/Tue, 29 Dec 2020 00:00:00 +0000https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/<p>Imagine you have a time series.
Let’s say, after each new observation, you want to know an “average” value across the last <span class="math inline">\(L\)</span> observations.
Such a metric is known as <a href="https://en.wikipedia.org/wiki/Moving_average">a moving average</a>
(or rolling/running average).</p>
<p>The most popular moving average example is <a href="https://en.wikipedia.org/wiki/Moving_average#Simple_moving_average">the moving mean</a>.
It’s easy to efficiently implement this metric.
However, it has a major drawback: it’s not robust.
Outliers can easily spoil the moving mean and transform it into a meaningless and untrustable metric.</p>
<p>Fortunately, we have a good alternative: <a href="https://en.wikipedia.org/wiki/Moving_average#Moving_median">the moving median</a>.
Typically, it generates a stable and smooth series of values.
In the below figure, you can see the difference between the moving mean and the moving median on noisy data.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/img/example-light.png" target="_blank" class="imgldlink" alt="example">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/img/example-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/img/example-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/partitioning-heaps-quantile-estimator/img/example-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>The moving median also has a drawback: it’s not easy to efficiently implement it.
Today we going to discuss the Hardle-Steiger method to estimate the median
(memory: <span class="math inline">\(O(L)\)</span>, element processing complexity: <span class="math inline">\(O(log(L))\)</span>, median estimating complexity: <span class="math inline">\(O(1)\)</span>).
Also, we will learn how to calculate <em>the moving quantiles</em> based on this method.</p>
<p>In this post, you will find the following:</p>
<ul>
<li>An overview of the Hardle-Steiger method</li>
<li>A simple way to implement the Hardle-Steiger method</li>
<li>Moving quantiles inspired by the Hardle-Steiger method</li>
<li>How to process initial elements</li>
<li>Reference C# implementation</li>
</ul>Coverage of quantile confidence intervalshttps://aakinshin.net/posts/quantile-ci-coverage/Tue, 22 Dec 2020 00:00:00 +0000https://aakinshin.net/posts/quantile-ci-coverage/<p>There is a common <a href="https://en.wikipedia.org/wiki/Confidence_interval#Misunderstandings">misunderstanding</a>
that a 95% confidence interval is an interval that covers the true parameter value with 95% probability.
Meanwhile, the correct definition assumes that
the true parameter value will be covered by 95% of 95% confidence intervals <em>in the long run</em>.
These two statements sound similar, but there is a huge difference between them.
95% in this context is not a property of a single confidence interval.
Once you get a calculated interval, it may cover the true value (100% probability) or
it may don’t cover it (0% probability).
In fact, 95% is a <em>prediction</em> about the percentage of <em>future</em> confidence intervals
that cover the true value <em>in the long run</em>.</p>
<p>However, even if you know the correct definition, you still may experience some troubles.
The first thing people usually forgot is the “long run” part.
For example, if we collected 100 samples and calculated a 95% confidence interval of a parameter for each of them,
we shouldn’t expect that 95 of these intervals cover the true parameter value.
In fact, we can observe a situation when none of these intervals covers the true value.
Of course, this is an unlikely event, but if you automatically perform thousands of different experiments,
you will definitely get some extreme situations.</p>
<p>The second thing that may create trouble is the “prediction” part.
If weather forecasters predicted that it will rain tomorrow, this does not mean that it will rain tomorrow.
The same works for statistical predictions.
The actual prediction reliability may depend on many factors.
If you estimate confidence intervals around the mean for the normal distribution, you are most likely safe.
However, if you estimate confidence intervals around quantiles for non-parametric distributions,
you should care about the following things:</p>
<ul>
<li>The used approach to estimate confidence intervals</li>
<li>The underlying distribution</li>
<li>The sample size</li>
<li>The position of the target quantile</li>
</ul>
<p>I <a href="https://aakinshin.net/posts/weighted-quantiles-ci/">have already showed</a> how to estimate
the confidence interval around the given quantile using the Maritz-Jarrett method.
It’s time to verify the reliability of this approach.
In this post, I’m going to show some Monte-Carlo simulations that evaluate the coverage percentage in different situations.</p>Statistical approaches for performance analysishttps://aakinshin.net/posts/statistics-for-performance/Tue, 15 Dec 2020 00:00:00 +0000https://aakinshin.net/posts/statistics-for-performance/<p>Software performance is a complex discipline that requires knowledge in different areas
from benchmarking to the internals of modern runtimes, operating systems, and hardware.
Surprisingly, the most difficult challenges in performance analysis are not about programming,
they are about mathematical statistics!</p>
<p>Many software developers can drill into performance problems and implement excellent optimizations,
but they are not always know how to correctly verify these optimizations.
This may not look like a problem in the case of a single performance investigation.
However, the situation became worse when developers try to set up an infrastructure that
should automatically find performance problems or prevent degradations from merging.
In order to make such an infrastructure reliable and useful,
it’s crucial to achieve an extremely low false-positive rate (otherwise, it’s not trustable)
and be able to detect most of the degradations (otherwise, it’s not so useful).
It’s not easy if you don’t know which statistical approaches should be used.
If you try to google it, you may find thousands of papers about statistics,
but only a small portion of them really works in practice.</p>
<p>In this post, I want to share some approaches that I use for performance analysis in everyday life.
I have been analyzing performance distributions for the last seven years,
and I have found a lot of approaches, metrics, and tricks which nice to have
in your statistical toolbox.
I would not say that all of them are must have to know,
but they can definitely help you to improve the reliability of your statistical checks
in different problems of performance analysis.
Consider the below list as a letter to a younger version of myself with a brief list of topics that are good to learn.</p>Quantile confidence intervals for weighted sampleshttps://aakinshin.net/posts/weighted-quantiles-ci/Tue, 08 Dec 2020 00:00:00 +0000https://aakinshin.net/posts/weighted-quantiles-ci/<p><strong>Update 2021-07-06:
the approach was updated using the <a href="https://aakinshin.net/posts/kish-ess-weighted-quantiles/">Kish’s effective sample size</a>.</strong></p>
<p>When you work with non-parametric distributions,
quantile estimations are essential to get the main distribution properties.
Once you get the estimation values, you may be interested in measuring the accuracy of these estimations.
Without it, it’s hard to understand how trustable the obtained values are.
One of the most popular ways to evaluate accuracy is confidence interval estimation.</p>
<p>Now imagine that you collect some measurements every day.
Each day you get a small sample of values that is not enough to get the accurate daily quantile estimations.
However, the full time-series over the last several weeks has a decent size.
You suspect that past measurements should be similar to today measurements,
but you are not 100% sure about it.
You feel a temptation to extend the up-to-date sample by the previously collected values,
but it may spoil the estimation (e.g., in the case of recent change points or positive/negative trends).</p>
<p>One of the possible approaches in this situation is to use <em>weighted samples</em>.
This assumes that we add past measurements to the “today sample,”
but these values should have smaller weight.
The older measurement we take, the smaller weight it gets.
If you have consistent values across the last several days,
this approach works like a charm.
If you have any recent changes, you can detect such situations by huge confidence intervals
due to the sample inconsistency.</p>
<p>So, how do we estimate confidence intervals around quantiles for the weighted samples?
In one of the previous posts, I have already shown how to <a href="https://aakinshin.net/posts/weighted-quantiles/">estimate quantiles on weighted samples</a>.
In this post, I will show how to estimate quantile confidence intervals for weighted samples.</p>Quantile absolute deviation: estimating statistical dispersion around quantileshttps://aakinshin.net/posts/qad/Tue, 01 Dec 2020 00:00:00 +0000https://aakinshin.net/posts/qad/<p>There are many different metrics for <a href="https://en.wikipedia.org/wiki/Statistical_dispersion">statistical dispersion</a>.
The most famous one is the <a href="https://en.wikipedia.org/wiki/Standard_deviation">standard deviation</a>.
The standard deviation is the most popular way to describe the spread around the mean when
you work with normally distributed data.
However, if you work with non-normal distributions, this metric may be misleading.</p>
<p>In the world of non-parametric distributions,
the most common measure of <a href="https://en.wikipedia.org/wiki/Central_tendency">central tendency</a> is the median.
For the median, you can describe dispersion using the
<a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation around the median</a> (MAD).
It works great if the median is the only <a href="https://en.wikipedia.org/wiki/Summary_statistics">summary statistic</a> that you care about.
However, if you work with multimodal distributions
(they can be detected using the <a href="https://aakinshin.net/posts/lowland-multimodality-detection/">lowland multimodality detector</a>),
you may be interested in other quantiles as well.
So, it makes sense to learn how to describe dispersion around the given quantile.
Which metric should we choose?</p>
<p>Recently, I came up with a great solution to this problem.
We can generalize the median absolute deviation into the quantile absolute deviation (QAD) around the given quantile based on the Harrell-Davis quantile estimator.
I will show how to calculate it, how to interpret it, and how to get insights about distribution properties
from images like this one:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/qad/img/modal5-light.png" target="_blank" class="imgldlink" alt="modal5">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/qad/img/modal5-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/qad/img/modal5-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/qad/img/modal5-light.png">
</picture>
</a>
</div>
</div>
<br />P² quantile estimator: estimating the median without storing valueshttps://aakinshin.net/posts/p2-quantile-estimator/Tue, 24 Nov 2020 00:00:00 +0000https://aakinshin.net/posts/p2-quantile-estimator/<p><strong>Update: the estimator accuracy could be improved using a bunch of <a href="https://aakinshin.net/tags/research-p2qe/">patches</a>.</strong></p>
<p>Imagine that you are implementing performance telemetry in your application.
There is an operation that is executed millions of times, and you want to get its “average” duration.
It’s not a good idea to use the arithmetic mean because the obtained value can be easily spoiled by outliers.
It’s much better to use the median which is one of the most robust ways to describe the average.</p>
<p>The straightforward median estimation approach requires storing all the values.
In our case, it’s a bad idea to keep all the values because it will significantly increase the memory footprint.
Such telemetry is harmful because it may become a new bottleneck instead of monitoring the actual performance.</p>
<p>Another way to get the median value is to use a sequential quantile estimator
(also known as an online quantile estimator or a streaming quantile estimator).
This is an algorithm that allows calculating the median value (or any other quantile value)
using a fixed amount of memory.
Of course, it provides only an approximation of the real median value,
but it’s usually enough for typical telemetry use cases.</p>
<p>In this post, I will show one of the simplest sequential quantile estimators that is called the P² quantile estimator
(or the Piecewise-Parabolic quantile estimator).</p>Plain-text summary notation for multimodal distributionshttps://aakinshin.net/posts/modality-summary-notation/Tue, 17 Nov 2020 00:00:00 +0000https://aakinshin.net/posts/modality-summary-notation/<p>Let’s say you collected a lot of data and want to explore the underlying distributions of collected samples.
If you have only a few distributions, the best way to do that is to look at the density plots
(expressed via histograms, kernel density estimations, or <a href="https://aakinshin.net/posts/qrde-hd/">quantile-respectful density estimations</a>).
However, it’s not always possible.</p>
<p>Suppose you have to process dozens, hundreds, or even thousands of distributions.
In that case,
it may be extremely time-consuming to manually check visualizations of each distribution.
If you analyze distributions from the command line or send notifications about suspicious samples,
it may be impossible to embed images in the reports.
In these cases, there is a need to present a distribution using plain text.</p>
<p>One way to do that is plain text histograms.
Unfortunately, this kind of visualization may occupy o lot of space.
In complicated cases, you may need 20 or 30 lines per a single distribution.</p>
<p>Another way is to present classic <a href="https://en.wikipedia.org/wiki/Summary_statistics">summary statistics</a>
like mean or median, standard deviation or median absolute deviation, quantiles, skewness, and kurtosis.
There is another problem here:
without experience, it’s hard to reconstruct the true distribution shape based on these values.
Even if you are an experienced researcher, the statistical metrics may become misleading in the case of multimodal distributions.
Multimodality is one of the most severe challenges in distribution analysis because it distorts basic summary statistics.
It’s important to not only find such distribution but also have a way to present brief information about multimodality effects.</p>
<p>So, how can we condense the underlying distribution shape of a given sample to a short text line?
I didn’t manage to find an approach that works fine in my cases, so I came up with my own notation.
Most of the interpretation problems in my experiments arise from multimodality and outliers,
so I decided to focus on these two things and specifically highlight them.
Let’s consider this plot:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/modality-summary-notation/img/thumbnail-light.svg" target="_blank" class="imgldlink" alt="thumbnail">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/modality-summary-notation/img/thumbnail-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/modality-summary-notation/img/thumbnail-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/modality-summary-notation/img/thumbnail-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>I suggest describing it like this:</p>
<div class="highlight"><pre class="chroma"><code class="language-bash" data-lang="bash"><span class="o">{</span>1.00, 2.00<span class="o">}</span> + <span class="o">[</span>7.16<span class="p">;</span> 13.12<span class="o">]</span>_100 + <span class="o">{</span>19.00<span class="o">}</span> + <span class="o">[</span>27.69<span class="p">;</span> 32.34<span class="o">]</span>_100 + <span class="o">{</span>37.00..39.00<span class="o">}</span>_3
</code></pre></div><p>Let me explain the suggested notation in detail.</p>Intermodal outliershttps://aakinshin.net/posts/intermodal-outliers/Tue, 10 Nov 2020 00:00:00 +0000https://aakinshin.net/posts/intermodal-outliers/<p><a href="https://en.wikipedia.org/wiki/Outlier">Outlier</a> analysis is a typical step in distribution exploration.
Usually, we work with the “lower outliers” (extremely low values) and the “upper outliers” (extremely high values).
However, outliers are not always extreme values.
In the general case, an outlier is a value that significantly differs from other values in the same sample.
In the case of multimodal distribution, we can also consider outliers in the middle of the distribution.
Let’s call such outliers that we found between modes the “<em>intermodal outliers</em>.”</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/intermodal-outliers/img/step4-light.svg" target="_blank" class="imgldlink" alt="step4">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/intermodal-outliers/img/step4-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/intermodal-outliers/img/step4-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/intermodal-outliers/img/step4-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>Look at the above density plot.
It’s a bimodal distribution that is formed as a combination of two unimodal distributions.
Each of the unimodal distributions may have its own lower and upper outliers.
When we merge them, the upper outliers of the first distribution and the lower outliers of the second distribution
stop being lower or upper outliers.
However, if these values don’t belong to the modes, they still are a subject of interest.
In this post, I will show you how to detect such intermodal outliers
and how they can be used to form a better distribution description.</p>Lowland multimodality detectionhttps://aakinshin.net/posts/lowland-multimodality-detection/Tue, 03 Nov 2020 00:00:00 +0000https://aakinshin.net/posts/lowland-multimodality-detection/<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/lowland-multimodality-detection/img/data5-light.svg" target="_blank" class="imgldlink" alt="data5">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/lowland-multimodality-detection/img/data5-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/lowland-multimodality-detection/img/data5-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/lowland-multimodality-detection/img/data5-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>Multimodality is an essential feature of a distribution, which may create many troubles during automatic analysis.
One of the best ways to work with such distributions is to detect all the modes in advance based on the given samples.
Unfortunately, this problem is much harder than it looks like.</p>
<p>I tried many different approaches for multimodality detection, but none of them was good enough.
During the past several years, my approach of choice was the <a href="http://www.brendangregg.com/FrequencyTrails/modes.html">mvalue-based modal test</a> by Brendan Gregg.
It works nicely in simple cases, but I was constantly stumbling over noisy samples where this algorithm doesn’t produce reliable results.
Also, it has some limitations that make it unapplicable to some corner cases.</p>
<p>So, I needed a better approach.
Here are my main requirements:</p>
<ul>
<li>It should detect the exact mode locations and ranges</li>
<li>It should provide reliable results even on noisy samples</li>
<li>It should be able to detect multimodality even when some modes are extremely close to each other</li>
<li>It should work out of the box without tricky parameter tuning for each specific distribution</li>
</ul>
<p>I failed to find such an algorithm anywhere, so I came up with my own!
The current working title is “the lowland multimodality detector.”
It takes an estimation of the probability density function (PDF) and tries to find “lowlands” (areas that are much lower than neighboring peaks).
Next, it splits the plot by these lowlands and detects modes between them.
For the PDF estimation, it uses the <a href="https://aakinshin.net/posts/qrde-hd/">quantile-respectful density estimation based on the Harrell-Davis quantile estimator</a> (QRDE-HD).
Let me explain how it works in detail.</p>Quantile-respectful density estimation based on the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/qrde-hd/Tue, 27 Oct 2020 00:00:00 +0000https://aakinshin.net/posts/qrde-hd/<p>The idea of this post was born when I was working on a presentation for my recent <a href="https://dotnext.ru/en/">DotNext</a> <a href="https://www.youtube.com/watch?v=gc3yVybPuaY&list=PL21xssNXOJNGUROqzSTOC8uZL4W2QZpvK&index=1">talk</a>.
It had a <a href="https://slides.aakinshin.net/dotnext-piter2020/#193">slide</a> with a density plot like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/qrde-hd/img/riddle-light.png" target="_blank" class="imgldlink" alt="riddle">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/qrde-hd/img/riddle-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/qrde-hd/img/riddle-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/qrde-hd/img/riddle-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Here we can see a density plot based on a sample with highlighted <a href="https://en.wikipedia.org/wiki/Decile">decile</a> locations that split the plot into 10 equal parts.
Before the conference, I have been reviewed by <a href="https://twitter.com/VladimirSitnikv">@VladimirSitnikv</a>.
He raised a reasonable concern: it doesn’t look like all the density plot segments are equal and contain exactly 10% of the whole plot.
And he was right!</p>
<p>However, I didn’t make any miscalculations.
I generated a real sample with 61 elements.
Next, I build a density plot with the kernel density estimation (KDE) using the Sheather & Jones method and the normal kernel.
Next, I calculated decile values using the Harrell-Davis quantile estimator.
Although both the density plot and the decile values are calculated correctly and consistent with the sample,
they are not consistent with each other!
Indeed, such a density plot is just an estimation of the underlying distribution.
It has its own decile values, which are not equal to the sample decile values regardless of the used quantile estimator.
This problem is common for different kinds of visualization that presents density and quantiles at the same time (e.g., <a href="https://towardsdatascience.com/violin-plots-explained-fb1d115e023d">violin plots</a>)</p>
<p>It leads us to a question: how should we present the shape of our data together with quantile values without confusing inconsistency in the final image?
Today I will present a good solution: we should use the quantile-respectful density estimation based on the Harrell-Davis quantile estimator!
I know the title is a bit long, but it’s not so complicated as it sounds.
In this post, I will show how to build such plots.
Also I will compare them to the classic histograms and kernel density estimations.
As a bonus, I will demonstrate how awesome these plots are for multimodality detection.</p>Misleading histogramshttps://aakinshin.net/posts/misleading-histograms/Tue, 20 Oct 2020 00:00:00 +0000https://aakinshin.net/posts/misleading-histograms/<p>Below you see two histograms.
What could you say about them?</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/misleading-histograms/img/hist-riddle-light.svg" target="_blank" class="imgldlink" alt="hist-riddle">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/misleading-histograms/img/hist-riddle-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/misleading-histograms/img/hist-riddle-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/misleading-histograms/img/hist-riddle-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>Most likely, you say that the first histogram is based on a uniform distribution,
and the second one is based on a multimodal distribution with four modes.
Although this is not obvious from the plots,
both histograms are based on the same sample:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="m">20.13</span><span class="p">,</span> <span class="m">19.94</span><span class="p">,</span> <span class="m">20.03</span><span class="p">,</span> <span class="m">20.06</span><span class="p">,</span> <span class="m">20.04</span><span class="p">,</span> <span class="m">19.98</span><span class="p">,</span> <span class="m">20.15</span><span class="p">,</span> <span class="m">19.99</span><span class="p">,</span> <span class="m">20.20</span><span class="p">,</span> <span class="m">19.99</span><span class="p">,</span> <span class="m">20.13</span><span class="p">,</span> <span class="m">20.22</span><span class="p">,</span> <span class="m">19.86</span><span class="p">,</span> <span class="m">19.97</span><span class="p">,</span> <span class="m">19.98</span><span class="p">,</span> <span class="m">20.06</span><span class="p">,</span>
<span class="m">29.97</span><span class="p">,</span> <span class="m">29.73</span><span class="p">,</span> <span class="m">29.75</span><span class="p">,</span> <span class="m">30.13</span><span class="p">,</span> <span class="m">29.96</span><span class="p">,</span> <span class="m">29.82</span><span class="p">,</span> <span class="m">29.98</span><span class="p">,</span> <span class="m">30.12</span><span class="p">,</span> <span class="m">30.18</span><span class="p">,</span> <span class="m">29.95</span><span class="p">,</span> <span class="m">29.97</span><span class="p">,</span> <span class="m">29.82</span><span class="p">,</span> <span class="m">30.04</span><span class="p">,</span> <span class="m">29.93</span><span class="p">,</span> <span class="m">30.04</span><span class="p">,</span> <span class="m">30.07</span><span class="p">,</span>
<span class="m">40.10</span><span class="p">,</span> <span class="m">39.93</span><span class="p">,</span> <span class="m">40.05</span><span class="p">,</span> <span class="m">39.82</span><span class="p">,</span> <span class="m">39.92</span><span class="p">,</span> <span class="m">39.91</span><span class="p">,</span> <span class="m">39.75</span><span class="p">,</span> <span class="m">40.00</span><span class="p">,</span> <span class="m">40.02</span><span class="p">,</span> <span class="m">39.96</span><span class="p">,</span> <span class="m">40.07</span><span class="p">,</span> <span class="m">39.92</span><span class="p">,</span> <span class="m">39.86</span><span class="p">,</span> <span class="m">40.04</span><span class="p">,</span> <span class="m">39.91</span><span class="p">,</span> <span class="m">40.14</span><span class="p">,</span>
<span class="m">49.95</span><span class="p">,</span> <span class="m">50.06</span><span class="p">,</span> <span class="m">50.03</span><span class="p">,</span> <span class="m">49.92</span><span class="p">,</span> <span class="m">50.15</span><span class="p">,</span> <span class="m">50.06</span><span class="p">,</span> <span class="m">50.00</span><span class="p">,</span> <span class="m">50.02</span><span class="p">,</span> <span class="m">50.06</span><span class="p">,</span> <span class="m">50.00</span><span class="p">,</span> <span class="m">49.70</span><span class="p">,</span> <span class="m">50.02</span><span class="p">,</span> <span class="m">49.96</span><span class="p">,</span> <span class="m">50.01</span><span class="p">,</span> <span class="m">50.05</span><span class="p">,</span> <span class="m">50.13</span>
</code></pre></div><p>Thus, the only difference between histograms is the offset!</p>
<p>Visualization is a simple way to understand the shape of your data.
Unfortunately, this way may easily become a slippery slope.
In the <a href="https://aakinshin.net/posts/kde-bw/">previous post</a>, I have shown how density plots may deceive you when the bandwidth is poorly chosen.
Today, we talk about histograms and why you can’t trust them in the general case.</p>The importance of kernel density estimation bandwidthhttps://aakinshin.net/posts/kde-bw/Tue, 13 Oct 2020 00:00:00 +0000https://aakinshin.net/posts/kde-bw/<p>Below see two kernel density estimations.
What could you say about them?</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/kde-bw/img/kde-riddle-light.svg" target="_blank" class="imgldlink" alt="kde-riddle">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/kde-bw/img/kde-riddle-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/kde-bw/img/kde-riddle-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/kde-bw/img/kde-riddle-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>Most likely, you say that the first plot is based on a uniform distribution,
and the second one is based on a multimodal distribution with four modes.
Although this is not obvious from the plots,
both density plots are based on the same sample:</p>
<div class="highlight"><pre class="chroma"><code class="language-txt" data-lang="txt">21.370, 19.435, 20.363, 20.632, 20.404, 19.893, 21.511, 19.905, 22.018, 19.93,
31.304, 32.286, 28.611, 29.721, 29.866, 30.635, 29.715, 27.343, 27.559, 31.32,
39.693, 38.218, 39.828, 41.214, 41.895, 39.569, 39.742, 38.236, 40.460, 39.36,
50.455, 50.704, 51.035, 49.391, 50.504, 48.282, 49.215, 49.149, 47.585, 50.03
</code></pre></div><p>The only difference between plots is in <a href="https://en.wikipedia.org/wiki/Kernel_density_estimation#Bandwidth_selection">bandwidth selection</a>!</p>
<p>Bandwidth selection is crucial when you are trying to visualize your distributions.
Unfortunately, most people just call a regular function to build a density plot and don’t think about how the bandwidth will be chosen.
As a result, the plot may present data in the wrong way, which may lead to incorrect conclusions.
Let’s discuss bandwidth selection in detail and figure out how to improve the correctness of your density plots.
In this post, we will cover the following topics:</p>
<ul>
<li>Kernel density estimation</li>
<li>How bandwidth selection affects plot smoothness</li>
<li>Which bandwidth selectors can we use</li>
<li>Which bandwidth selectors should we use</li>
<li>Insidious default bandwidth selectors in statistical packages</li>
</ul>The median absolute deviation value of the Gumbel distributionhttps://aakinshin.net/posts/gumbel-mad/Tue, 06 Oct 2020 00:00:00 +0000https://aakinshin.net/posts/gumbel-mad/<p>The <a href="https://en.wikipedia.org/wiki/Gumbel_distribution">Gumbel distribution</a> is not only a useful model in the <a href="https://en.wikipedia.org/wiki/Extreme_value_theory">extreme value theory</a>,
but it’s also a nice example of a slightly right-skewed distribution (skewness <span class="math inline">\(\approx 1.14\)</span>).
Here is its density plot:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/gumbel-mad/img/gumbel-light.svg" target="_blank" class="imgldlink" alt="gumbel">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/gumbel-mad/img/gumbel-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/gumbel-mad/img/gumbel-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='600'
src="https://aakinshin.net/posts/gumbel-mad/img/gumbel-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>In some of my statistical experiments, I like to use the Gumbel distribution as a sample generator for hypothesis checking or unit tests.
I also prefer the <a href="https://en.wikipedia.org/wiki/Median_absolute_deviation">median absolute deviation</a> (MAD) over the standard deviation as a measure of dispersion because it’s more robust in the case of non-parametric distributions.
Numerical hypothesis verification often requires the exact value of the median absolute deviation of the original distribution.
I didn’t find this value in the reference tables, so I decided to do another exercise and derive it myself.
In this post, you will find a short derivation and the result (spoiler: the exact value is <code>0.767049251325708 * β</code>).
The general approach of the MAD derivation is common for most distributions, so it can be easily reused.</p>Weighted quantile estimatorshttps://aakinshin.net/posts/weighted-quantiles/Tue, 29 Sep 2020 00:00:00 +0000https://aakinshin.net/posts/weighted-quantiles/<p><strong>Update 2021-07-06:
the approach was updated using the <a href="https://aakinshin.net/posts/kish-ess-weighted-quantiles/">Kish’s effective sample size</a>.</strong></p>
<p>In this post, I will show how to calculate weighted quantile estimates and how to use them in practice.</p>
<p>Let’s start with a problem from real life.
Imagine that you measure the total duration of a unit test executed daily on a CI server.
Every day you get a single number that corresponds to the test duration from the latest revision for this day:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/weighted-quantiles/img/moving1-light.svg" target="_blank" class="imgldlink" alt="moving1">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/weighted-quantiles/img/moving1-dark.svg"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/weighted-quantiles/img/moving1-light.svg"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='600'
src="https://aakinshin.net/posts/weighted-quantiles/img/moving1-light.svg">
</picture>
</a>
</div>
</div>
<br />
<p>You collect a history of such measurements for 100 days.
Now you want to describe the “actual” distribution of the performance measurements.</p>
<p>However, for the latest “actual” revision, you have only a single measurement, which is not enough to build a distribution.
Also, you can’t build a distribution based on the last N measurements because they can contain change points that will spoil your results.
So, what you really want to do is to use all the measurements, but older values should have a lower impact on the final distribution form.</p>
<p>Such a problem can be solved using the weighted quantiles!
This powerful approach can be applied to any time series regardless of the domain area.
In this post, we learn how to calculate and apply weighted quantiles.</p>Nonparametric Cohen's d-consistent effect sizehttps://aakinshin.net/posts/nonparametric-effect-size/Thu, 25 Jun 2020 00:00:00 +0000https://aakinshin.net/posts/nonparametric-effect-size/<p><strong>Update: the second part of this post is available <a href="https://aakinshin.net/posts/nonparametric-effect-size2/">here</a>.</strong></p>
<p>The effect size is a common way to describe a difference between two distributions.
When these distributions are normal, one of the most popular approaches to express the effect size is <a href="https://en.wikipedia.org/wiki/Effect_size#Cohen's_d">Cohen’s d</a>.
Unfortunately, it doesn’t work great for non-normal distributions.</p>
<p>In this post, I will show a robust Cohen’s d-consistent effect size formula for nonparametric distributions.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/nonparametric-effect-size/img/blackboard.png" target="_blank" alt="blackboard">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/nonparametric-effect-size/img/blackboard.png" />
</a>
</div>
</div>
<br />DoubleMAD outlier detector based on the Harrell-Davis quantile estimatorhttps://aakinshin.net/posts/harrell-davis-double-mad-outlier-detector/Mon, 22 Jun 2020 00:00:00 +0000https://aakinshin.net/posts/harrell-davis-double-mad-outlier-detector/<p>Outlier detection is an important step in data processing.
Unfortunately, if the distribution is not normal (e.g., right-skewed and heavy-tailed), it’s hard to choose
a robust outlier detection algorithm that will not be affected by tricky distribution properties.
During the last several years, I tried many different approaches, but I was not satisfied with their results.
Finally, I found an algorithm to which I have (almost) no complaints.
It’s based on the <em>double median absolute deviation</em> and the <em>Harrell-Davis quantile estimator</em>.
In this post, I will show how it works and why it’s better than some other approaches.</p>How ListSeparator Depends on Runtime and Operating Systemhttps://aakinshin.net/posts/how-listseparator-depends-on-runtime-and-operating-system/Wed, 20 May 2020 00:00:00 +0000https://aakinshin.net/posts/how-listseparator-depends-on-runtime-and-operating-system/<p><em>This blog post was <a href="https://blog.jetbrains.com/dotnet/2020/05/20/listseparator-depends-runtime-operating-system/">originally posted</a> on <a href="https://blog.jetbrains.com/dotnet/">JetBrains .NET blog</a>.</em></p>
<p>In the two previous blog posts from this series, we discussed how socket errors and socket orders depend on the runtime and operating systems. For some, it may be obvious that some things are indeed specific to the operating system or the runtime, but often these issues come as a surprise and are only discovered when running our code on different systems.
An interesting example that may bite us at runtime is using <code>ListSeparator</code> in our code. It should give us a common separator for list elements in a string. But is it really common?
Let’s start our investigation by printing <code>ListSeparator</code> for the Russian language:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="k">new</span> <span class="n">CultureInfo</span><span class="p">(</span><span class="s">"ru-ru"</span><span class="p">).</span><span class="n">TextInfo</span><span class="p">.</span><span class="n">ListSeparator</span><span class="p">);</span>
</code></pre></div><p>On Windows, you will get the same result for .NET Framework, .NET Core, and Mono: the <code>ListSeparator</code> is <code>;</code> (a semicolon). You will also get a semicolon on Mono+Unix. However, on .NET Core+Unix, you will get a <a href="https://en.wikipedia.org/wiki/Non-breaking_space">non-breaking space</a>.</p>How Sorting Order Depends on Runtime and Operating Systemhttps://aakinshin.net/posts/how-sorting-order-depends-on-runtime-and-operating-system/Wed, 13 May 2020 00:00:00 +0000https://aakinshin.net/posts/how-sorting-order-depends-on-runtime-and-operating-system/<p><em>This blog post was <a href="https://blog.jetbrains.com/dotnet/2020/05/13/sorting-order-depends-runtime-operating-system/">originally posted</a> on <a href="https://blog.jetbrains.com/dotnet/">JetBrains .NET blog</a>.</em></p>
<p>In <a href="https://www.jetbrains.com/rider/">Rider</a>, we have unit tests that enumerate files in your project and dump a sorted list of these files. In one of our test projects, we had the following files: <code>jquery-1.4.1.js</code>, <code>jquery-1.4.1.min.js</code>, <code>jquery-1.4.1-vsdoc.js</code>. On Windows, .NET Framework, .NET Core, and Mono produce the same sorted list:</p>
<div class="highlight"><pre class="chroma"><code class="language-bash" data-lang="bash">jquery-1.4.1.js
jquery-1.4.1.min.js
jquery-1.4.1-vsdoc.js
</code></pre></div>How Socket Error Codes Depend on Runtime and Operating Systemhttps://aakinshin.net/posts/how-socket-error-codes-depend-on-runtime-and-operating-system/Mon, 27 Apr 2020 00:00:00 +0000https://aakinshin.net/posts/how-socket-error-codes-depend-on-runtime-and-operating-system/<p><em>This blog post was <a href="https://blog.jetbrains.com/dotnet/2020/04/27/socket-error-codes-depend-runtime-operating-system/">originally posted</a> on <a href="https://blog.jetbrains.com/dotnet/">JetBrains .NET blog</a>.</em></p>
<p><a href="https://www.jetbrains.com/rider/">Rider</a> consists of several processes that send messages to each other via sockets. To ensure the reliability of the whole application, it’s important to properly handle all the socket errors. In our codebase, we had the following code which was adopted from <a href="https://github.com/mono/debugger-libs/blob/master/Mono.Debugging.Soft/SoftDebuggerSession.cs#L273">Mono Debugger Libs</a> and helps us communicate with debugger processes:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">protected</span> <span class="k">virtual</span> <span class="kt">bool</span> <span class="n">ShouldRetryConnection</span> <span class="p">(</span><span class="n">Exception</span> <span class="n">ex</span><span class="p">,</span> <span class="kt">int</span> <span class="n">attemptNumber</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">sx</span> <span class="p">=</span> <span class="n">ex</span> <span class="k">as</span> <span class="n">SocketException</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sx</span> <span class="p">!=</span> <span class="k">null</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sx</span><span class="p">.</span><span class="n">ErrorCode</span> <span class="p">==</span> <span class="m">10061</span><span class="p">)</span> <span class="c1">//connection refused
</span><span class="c1"></span> <span class="k">return</span> <span class="k">true</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="k">false</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div><p>In the case of a failed connection because of a “ConnectionRefused” error, we are retrying the connection attempt. It works fine with .NET Framework and Mono. However, once we migrated to .NET Core, this method no longer correctly detects the “connection refused” situation on Linux and macOS. If we open the <code>SocketException</code> <a href="https://docs.microsoft.com/en-us/dotnet/api/system.net.sockets.socketexception?view=netframework-4.8">documentation</a>, we will learn that this class has three different properties with error codes:</p>
<ul>
<li><code>SocketError SocketErrorCode</code>: Gets the error code that is associated with this exception.</li>
<li><code>int ErrorCode</code>: Gets the error code that is associated with this exception.</li>
<li><code>int NativeErrorCode</code>: Gets the Win32 error code associated with this exception.</li>
</ul>
What's the difference between these properties? Should we expect different values on different runtimes or different operating systems? Which one should we use in production? Why do we have problems with <code>ShouldRetryConnection</code> on .NET Core? Let's figure it all out!.NET Core performance revolution in Rider 2020.1https://aakinshin.net/posts/netcore-performance-revolution-in-rider-2020-1/Tue, 14 Apr 2020 00:00:00 +0000https://aakinshin.net/posts/netcore-performance-revolution-in-rider-2020-1/<p><em>This blog post was <a href="https://blog.jetbrains.com/dotnet/2020/04/14/net-core-performance-revolution-rider-2020-1/">originally posted</a> on <a href="https://blog.jetbrains.com/dotnet/">JetBrains .NET blog</a>.</em></p>
<p>Many <a href="https://www.jetbrains.com/rider/">Rider</a> users may know that <a href="https://www.codemag.com/Article/1811091/Building-a-.NET-IDE-with-JetBrains-Rider">the IDE has two main processes</a>: frontend (Java-application based on the IntelliJ platform) and backend (.NET-application based on ReSharper). Since the first release of Rider, we’ve used Mono as the backend runtime on Linux and macOS. A few years ago, we decided to migrate to .NET Core. After resolving hundreds of technical challenges, <strong>we are finally ready to present the .NET Core edition of Rider!</strong></p>
<p>In this blog post, we want to share the results of some benchmarks that compare the Mono-powered and the .NET Core-powered editions of Rider. You may find this interesting if you are also thinking about migrating to .NET Core, or if you just want a high-level overview of the improvements to Rider in terms of performance and footprint, following the migration. (Spoiler: they’re huge!)</p>Introducing perfolizerhttps://aakinshin.net/posts/introducing-perfolizer/Wed, 04 Mar 2020 00:00:00 +0000https://aakinshin.net/posts/introducing-perfolizer/<p>Over the last 7 years, I’ve been maintaining <a href="https://github.com/dotnet/BenchmarkDotNet">BenchmarkDotNet</a>;
it’s a library that helps you to transform methods into benchmarks, track their performance, and share reproducible measurement experiments.
Today, BenchmarkDotNet became the most popular .NET library for benchmarking which was adopted by <a href="https://github.com/dotnet/BenchmarkDotNet#who-use-benchmarkdotnet">3500+</a> projects including .NET Core.</p>
<p>While it has tons of features for benchmarking that allows getting reliable and accurate measurements,
it has a limited set of features for performance analysis.
And it’s a problem for many developers.
Lately, I started to get a lot of emails when people ask me
“OK, I benchmarked my application and got tons of numbers. What should I do next?”
It’s an excellent question that requires special tools.
So, I decided to start another project that focuses specifically on performance analysis.</p>
<p>Meet <a href="https://github.com/AndreyAkinshin/perfolizer">perfolizer</a> — a toolkit for performance analysis!
The source code is available on <a href="https://github.com/AndreyAkinshin/perfolizer">GitHub</a> under the MIT license.</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/introducing-perfolizer/img/perfolizer.svg" target="_blank" alt="perfolizer">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/introducing-perfolizer/img/perfolizer.svg" />
</a>
</div>
</div>
<br />Distribution comparison via the shift and ratio functionshttps://aakinshin.net/posts/shift-and-ratio-functions/Fri, 11 Oct 2019 00:00:00 +0000https://aakinshin.net/posts/shift-and-ratio-functions/<p>When we compare two distributions, it’s not always enough to detect a statistically significant difference between them.
In many cases, we also want to evaluate the magnitude of this difference.
Let’s look at the following image:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare1-light.png" target="_blank" class="imgldlink" alt="compare1">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare1-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare1-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare1-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>On the left side, we can see a timeline plot with 2000 points
(at the middle of this plot, the distribution was significantly changed).
On the right side, you can see density plots for the left and the right side of
the timeline plot (before and after the change).
It’s a pretty simple case, the difference between distributions be expressed via the
difference between mean values.</p>
<p>Now let’s look at a more tricky case:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare2-light.png" target="_blank" class="imgldlink" alt="compare2">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare2-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare2-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare2-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Here we have a bimodal distribution; after the change, the left mode “moved right.”
Now it’s much harder to evaluate the difference between distributions
because the mean and the median values almost not changed:
the right mode has the biggest impact on these metrics than the left more.</p>
<p>And here is a much more tricky case:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare3-light.png" target="_blank" class="imgldlink" alt="compare3">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare3-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare3-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/shift-and-ratio-functions/img/compare3-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Here we also have a bimodal distribution; after the change, both modes moved:
the left mode “moved right” and the right mode “moved left.”
How should we describe the difference between these distributions now?</p>Normality is a mythhttps://aakinshin.net/posts/normality/Wed, 09 Oct 2019 00:00:00 +0000https://aakinshin.net/posts/normality/<p>In many statistical papers, you can find the following phrase: “assuming that we have a normal distribution.”
Probably, you saw plots of the normal distribution density function in some statistics textbooks,
it looks like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/normality/img/normal-light.png" target="_blank" class="imgldlink" alt="normal">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/normality/img/normal-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/normality/img/normal-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/normality/img/normal-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>The normal distribution is a pretty user-friendly mental model when we are trying to interpret the statistical metrics
like mean and standard deviation.
However, it may also be an insidious and misleading model when your distribution is not normal.
There is a great sentence in the <a href="https://doi.org/10.1093/biomet/34.3-4.209">“Testing for normality”</a> paper by R.C. Geary, 1947 (the quote was found <a href="https://garstats.wordpress.com/2019/06/17/myth/">here</a>):</p>
<blockquote>
<p>Normality is a myth; there never was, and never will be, a normal distribution.</p>
</blockquote>
<p>I 100% agree with this statement.
At least, if you are working with performance distributions
(that are based on the multiple iterations of your benchmarks that measure the performance metrics of your applications),
you should forget about normality.
That’s how a typical performance distribution looks like
(I built the below picture based on a real benchmark that measures the load time of assemblies
when we open the <a href="https://github.com/OrchardCMS/Orchard">Orchard</a> solution in <a href="https://www.jetbrains.com/rider/">Rider</a> on Linux):</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/normality/img/performance-light.png" target="_blank" class="imgldlink" alt="performance">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/normality/img/performance-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/normality/img/performance-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/normality/img/performance-light.png">
</picture>
</a>
</div>
</div>
<br />Implementation of efficient algorithm for changepoint detection: ED-PELThttps://aakinshin.net/posts/edpelt/Mon, 07 Oct 2019 00:00:00 +0000https://aakinshin.net/posts/edpelt/<p><a href="https://en.wikipedia.org/wiki/Change_detection">Changepoint detection</a> is an important task that has a lot of applications.
For example, I use it to detect changes in the <a href="https://www.jetbrains.com/rider/">Rider</a> performance test suite.
It’s very important to detect not only performance degradations, but any kinds of performance changes
(e.g., the variance may increase, or a unimodal distribution may be split to several modes).
You can see examples of such changes on the following picture (we change the color when a changepoint is detected):</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/edpelt/img/edpelt-light.png" target="_blank" class="imgldlink" alt="edpelt">
<picture>
<source
theme='dark'
srcset="https://aakinshin.net/posts/edpelt/img/edpelt-dark.png"
media="(prefers-color-scheme: dark)">
<source
theme='light'
srcset="https://aakinshin.net/posts/edpelt/img/edpelt-light.png"
media="(prefers-color-scheme: light), (prefers-color-scheme: no-preference)">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/edpelt/img/edpelt-light.png">
</picture>
</a>
</div>
</div>
<br />
<p>Unfortunately, it’s pretty hard to write a reliable and fast algorithm for changepoint detection.
Recently, I found a cool paper (<a href="https://link.springer.com/article/10.1007/s11222-016-9687-5">Haynes, K., Fearnhead, P. & Eckley, I.A. “A computationally efficient nonparametric approach for changepoint detection,” Stat Comput (2017) 27: 1293</a>) that describes the ED-PELT algorithm.
It has <code>O(N*log(N))</code> complexity and pretty good detection accuracy.
The reference implementation can be used via the <a href="https://cran.r-project.org/web/packages/changepoint.np/index.html">changepoint.np</a> R package.
However, I can’t use <a href="https://www.r-project.org/">R</a> on our build server, so I decided to write my own C# implementation.</p>A story about slow NuGet package browsinghttps://aakinshin.net/posts/nuget-package-browsing/Tue, 08 May 2018 00:00:00 +0000https://aakinshin.net/posts/nuget-package-browsing/<p>In <a href="https://www.jetbrains.com/rider/">Rider</a>, we have integration tests which interact with <a href="https://api.nuget.org/">api.nuget.org</a>.
Also, we have an internal service which monitors the performance of these tests.
Two days ago, I noticed that some of these tests sometimes are running for too long.
For example, <code>nuget_NuGetTest_shouldUpgradeVersionForDotNetCore</code> usually takes around <code>10 sec</code>.
However, in some cases, it takes around <code>110 sec</code>, <code>210 sec</code>, or <code>310 sec</code>:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/nuget-package-browsing/img/perf-chart.png" target="_blank" alt="perf-chart">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/nuget-package-browsing/img/perf-chart.png" />
</a>
</div>
</div>
<br />
<p>It looks very suspicious and increases the whole test suite duration.
Also, our dashboard with performance degradations contains only such tests
and some real degradations (which are introduced by the changes in our codebase) can go unnoticed.
So, my colleagues and I decided to investigate it.</p>Cross-runtime .NET disassembly with BenchmarkDotNethttps://aakinshin.net/posts/dotnet-crossruntime-disasm/Tue, 10 Apr 2018 00:00:00 +0000https://aakinshin.net/posts/dotnet-crossruntime-disasm/<p><a href="https://github.com/dotnet/BenchmarkDotNet">BenchmarkDotNet</a> is a cool tool for benchmarking.
It has a lot of useful features that help you with performance investigations.
However, you can use these features even if you are not actually going to benchmark something.
One of these features is <code>DisassemblyDiagnoser</code>.
It shows you a disassembly listing of your code for all required runtimes.
In this post, I will show you how to get disassembly listing for .NET Framework, .NET Core, and Mono with one click!
You can do it with a very small code snippet like this:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="na">[DryCoreJob, DryMonoJob, DryClrJob(Platform.X86)]</span>
<span class="na">[DisassemblyDiagnoser]</span>
<span class="k">public</span> <span class="k">class</span> <span class="nc">IntroDisasm</span>
<span class="p">{</span>
<span class="na"> [Benchmark]</span>
<span class="k">public</span> <span class="kt">double</span> <span class="n">Sum</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">double</span> <span class="n">res</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="m">64</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">res</span> <span class="p">+=</span> <span class="n">i</span><span class="p">;</span>
<span class="k">return</span> <span class="n">res</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>BenchmarkDotNet v0.10.14https://aakinshin.net/posts/bdn-v0_10_14/Mon, 09 Apr 2018 00:00:00 +0000https://aakinshin.net/posts/bdn-v0_10_14/<p>BenchmarkDotNet v0.10.14 has been released! This release includes:</p>
<ul>
<li><strong>Per-method parameterization</strong> (<a href="http://benchmarkdotnet.org/Advanced/Arguments.htm">Read more</a>)</li>
<li><strong>Console histograms and multimodal disribution detection</strong> (<a href="https://aakinshin.net/posts/dotnet-crossruntime-disasm/">Read more</a>)</li>
<li><strong>Many improvements for Mono disassembly support on Windows</strong> (A blog post is coming soon)</li>
<li><strong>Many bugfixes</strong></li>
</ul>
<p>In the <a href="https://github.com/dotnet/BenchmarkDotNet/issues?q=milestone:v0.10.14">v0.10.14</a> scope,
8 issues were resolved and 11 pull requests where merged.
This release includes 47 commits by 8 contributors.</p>BenchmarkDotNet v0.10.13https://aakinshin.net/posts/bdn-v0_10_13/Fri, 02 Mar 2018 00:00:00 +0000https://aakinshin.net/posts/bdn-v0_10_13/<p>BenchmarkDotNet v0.10.13 has been released! This release includes:</p>
<ul>
<li><strong>Mono Support for DisassemblyDiagnoser:</strong>
Now you can easily get an assembly listing not only on .NET Framework/.NET Core, but also on Mono.
It works on Linux, macOS, and Windows (Windows requires installed cygwin with <code>obj</code> and <code>as</code>).
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/541">#541</a>)</li>
<li><strong>Support ANY CoreFX and CoreCLR builds:</strong>
BenchmarkDotNet allows the users to run their benchmarks against ANY CoreCLR and CoreFX builds.
You can compare your local build vs MyGet feed or Debug vs Release or one version vs another.
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/651">#651</a>)</li>
<li><strong>C# 7.2 support</strong>
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/643">#643</a>)</li>
<li><strong>.NET 4.7.1 support</strong>
(See <a href="https://github.com/dotnet/BenchmarkDotNet/commit/28aa946a9a277b6c2b1166af0397134b02bedf2d">28aa94</a>)</li>
<li><strong>Support Visual Basic project files (.vbroj) targeting .NET Core</strong>
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/626">#626</a>)</li>
<li><strong>DisassemblyDiagnoser now supports generic types</strong>
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/640">#640</a>)</li>
<li><strong>Now it’s possible to benchmark both Mono and .NET Core from the same app</strong>
(See <a href="https://github.com/dotnet/BenchmarkDotNet/issues/653">#653</a>)</li>
<li><strong>Many bug fixes</strong>
(See details below)</li>
</ul>Analyzing distribution of Mono GC collectionshttps://aakinshin.net/posts/mono-gc-collects/Tue, 20 Feb 2018 00:00:00 +0000https://aakinshin.net/posts/mono-gc-collects/<p>Sometimes I want to understand the GC performance impact on an application quickly.
I know that there are many powerful diagnostic tools and approaches,
but I’m a fan of the “right tool for the job” idea.
In simple cases, I prefer simple noninvasive approaches which provide a quick way
to get an overview of the current situation
(if everything is terrible, I always can switch to an advanced approach).
Today I want to share with you my favorite way to quickly get statistics
of GC pauses in Mono and generate nice plots like this:</p>
<div class="row">
<div class="mx-auto">
<a href="https://aakinshin.net/posts/mono-gc-collects/img/plot-64.png" target="_blank" alt="plot-64">
<img
class="mx-auto d-block img-fluid"
width='800'
src="https://aakinshin.net/posts/mono-gc-collects/img/plot-64.png" />
</a>
</div>
</div>
<br />BenchmarkDotNet v0.10.12https://aakinshin.net/posts/bdn-v0_10_12/Mon, 15 Jan 2018 00:00:00 +0000https://aakinshin.net/posts/bdn-v0_10_12/<p>BenchmarkDotNet v0.10.12 has been released! This release includes:</p>
<ul>
<li><strong>Improved DisassemblyDiagnoser:</strong>
BenchmarkDotNet contains an embedded disassembler so that it can print assembly code for all benchmarks;
it’s not easy, but the disassembler evolves in every release.</li>
<li><strong>Improved MemoryDiagnoser:</strong>
it has a better precision level, and it takes less time to evaluate memory allocations in a benchmark.</li>
<li><strong>New TailCallDiagnoser:</strong>
now you get notifications when JIT applies the tail call optimizations to your methods.</li>
<li><strong>Better environment info:</strong>
when your share performance results, it’s very important to share information about your environment.
The library generates the environment summary for you by default.
Now it contains information about the amount of physical CPU, physical cores, and logic cores.
If you run a benchmark on a virtual machine, you will get the name of the hypervisor
(e.g., Hyper-V, VMware, or VirtualBox).</li>
<li><strong>Better summary table:</strong>
one of the greatest features of BenchmarkDotNet is the summary table.
It shows all important information about results in a compact and understandable form.
Now it has better customization options: you can display relative performance of different environments
(e.g., compare .NET Framework and .NET Core) and group benchmarks by categories.</li>
<li><strong>New GC settings:</strong> now we support <code>NoAffinitize</code>, <code>HeapAffinitizeMask</code>, <code>HeapCount</code>.</li>
<li>Other minor improvements and bug fixes</li>
</ul>BenchmarkDotNet v0.10.10https://aakinshin.net/posts/bdn-v0_10_10/Fri, 03 Nov 2017 00:00:00 +0000https://aakinshin.net/posts/bdn-v0_10_10/<p>BenchmarkDotNet v0.10.10 has been released!
This release includes many new features like Disassembly Diagnoser, ParamsSources, .NET Core x86 support, Environment variables, and more!</p>Reflecting on performance testinghttps://aakinshin.net/posts/reflecting-on-performance-testing/Tue, 01 Aug 2017 00:00:00 +0000https://aakinshin.net/posts/reflecting-on-performance-testing/<p>Performance is an important feature for many projects.
Unfortunately, it’s an all too common situation when a developer accidentally spoils the performance adding some new code.
After a series of such incidents, people often start to think about performance regression testing.</p>
<p>As developers, we write unit tests all the time.
These tests check that our business logic work as designed and that new features don’t break existing code.
It looks like a good idea to write some perf tests as well, which will verify that we don’t have any performance regressions.</p>
<p>Turns out this is harder than it sounds.
A lot of developers don’t write perf tests at all.
Some teams write perf tests, but almost all of them use their own infrastructure for analysis
(which is not a bad thing in general because it’s usually designed for specific projects and requirements).
There are a lot of books about test-driven development (TDD),
but there are no books about performance-driven development (PDD).
There are well-known libraries for unit-testing (like xUnit/NUnit/MSTest for .NET),
but there are almost no libraries for performance regression testing.
Yeah, of course, there are <em>some</em> libraries which you can use.
But there are troubles with <em>well-known all recognized</em> libraries, approaches, and tools.
Ask your colleagues about it: some of them will give you different answers, the rest of them will start Googling it.</p>
<p>There is no common understanding of what performance testing should look like.
This situation exists because it’s really hard to develop a solution which solves <em>all problems</em> for <em>all kind of projects</em>.
However, it doesn’t mean that we shouldn’t try.
And we should try, we should share our experience and discuss best practices.</p>Measuring Performance Improvements in .NET Core with BenchmarkDotNet (Part 1)https://aakinshin.net/posts/stephen-toub-benchmarks-part1/Fri, 09 Jun 2017 00:00:00 +0000https://aakinshin.net/posts/stephen-toub-benchmarks-part1/<p>A few days ago <a href="https://github.com/stephentoub">Stephen Toub</a> published a great post
at the <a href="https://blogs.msdn.microsoft.com/dotnet/">Microsoft .NET Blog</a>:
<a href="https://blogs.msdn.microsoft.com/dotnet/2017/06/07/performance-improvements-in-net-core/">Performance Improvements in .NET Core</a>.
He showed some significant performance changes in .NET Core 2.0 Preview 1 (compared with .NET Framework 4.7).
The .NET Core uses RyuJIT for generating assembly code.
When I first tried RyuJIT (e.g.,
<a href="https://blogs.msdn.microsoft.com/dotnet/2014/02/27/ryujit-ctp2-getting-ready-for-prime-time/">CTP2</a>,
<a href="https://blogs.msdn.microsoft.com/clrcodegeneration/2014/10/30/ryujit-ctp5-getting-closer-to-shipping-and-with-better-simd-support/">CTP5</a>, 2014),
I wasn’t excited about this: the preview versions had some bugs, and it worked slowly on my applications.
However, the idea of a rethought and open-source JIT-compiler was a huge step forward and investment in the future.
RyuJIT had been developed very actively in recent years: not only by Microsoft but with the help of the community.
I’m still not happy about the generated assembly code in some methods, but I have to admit that the RyuJIT (as a part of .NET Core) works pretty well today:
it shows a good performance level not only on artificial benchmarks but also on real user code.
Also, there are a lot of changes
not only in <a href="https://github.com/dotnet/coreclr">dotnet/coreclr</a> (the .NET Core runtime),
but also in <a href="https://github.com/dotnet/corefx">dotnet/corefx</a> (the .NET Core foundational libraries).
It’s very nice to watch how the community helps to optimize well-used classes which have not changed for years.</p>
<p>Now let’s talk about benchmarks.
For the demonstration, Stephen wrote a set of handwritten benchmarks.
A few people (in
<a href="https://blogs.msdn.microsoft.com/dotnet/2017/06/07/performance-improvements-in-net-core/#comments">comments</a> and on <a href="https://news.ycombinator.com/item?id=14507936">HackerNews</a>)
asked about <a href="https://github.com/dotnet/BenchmarkDotNet">BenchmarkDotNet</a> regarding these samples (as a better tool for performance measurements).
So, I decided to try all these benchmarks on BenchmarkDotNet.</p>
<p>In this post, we will discuss
how can BenchmarkDotNet help in such performance investigations,
which benchmarking approaches (and when) are better to use,
and how can we improve these measurements.</p>BenchmarkDotNet v0.10.7https://aakinshin.net/posts/bdn-v0_10_7/Mon, 05 Jun 2017 00:00:00 +0000https://aakinshin.net/posts/bdn-v0_10_7/<p>BenchmarkDotNet v0.10.7 has been released.
In this post, I will briefly cover the following features:</p>
<ul>
<li>LINQPad support</li>
<li>Filters and categories</li>
<li>Updated Setup/Cleanup attributes</li>
<li>Better Value Types support</li>
<li>Building Sources on Linux</li>
</ul>65535 interfaces ought to be enough for anybodyhttps://aakinshin.net/posts/mono-and-65535interfaces/Tue, 14 Feb 2017 00:00:00 +0000https://aakinshin.net/posts/mono-and-65535interfaces/<p>It was a bright, sunny morning.
There were no signs of trouble.
I came to work, opened Slack, and received many messages from my coworkers about failed tests.</p>
<div class="mx-auto">
<img class="mx-auto d-block" width="800" src="https://aakinshin.net/img/posts/dotnet/mono-and-65535interfaces/front.png" />
</div>
<p>After a few hours of investigation, the situation became clear:</p>
<ul>
<li>I’m responsible for the unit tests subsystem in <a href="https://www.jetbrains.com/rider/">Rider</a>, and only tests from this subsystem were failing.</li>
<li>I didn’t commit anything to the subsystem for a week because I worked with a local branch.
Other developers also didn’t touch this code.</li>
<li>The unit tests subsystem is completely independent.
It’s hard to imagine a situation when only the corresponded tests would fail, thousands of other tests pass, and there are no changes in the source code.</li>
<li><code>git blame</code> helped to find the “bad commit”: it didn’t include anything suspicious, only a few additional classes in other subsystems.</li>
<li>Only tests on Linux and MacOS were red.
On Windows, everything was ok.</li>
<li>Stacktraces in failed tests were completely random.
We had a new stack trace in each test from different subsystems.
There was no connection between these stack traces, unit tests source code, and the changes in the “bad commit.”
There was no clue where we should look for a problem.</li>
</ul>
<p>So, what was special about this “bad commit”? Spoiler: after these changes, we sometimes have more than 65535 interface implementations at runtime.</p>A bug story about named mutex on Monohttps://aakinshin.net/posts/namedmutex-on-mono/Mon, 13 Feb 2017 00:00:00 +0000https://aakinshin.net/posts/namedmutex-on-mono/<p>When you write some multithreading magic on .NET,
you can use a cool synchronization primitive called <a href="https://msdn.microsoft.com/en-us/library/system.threading.mutex(v=vs.110).aspx">Mutex</a>:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">var</span> <span class="n">mutex</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Mutex</span><span class="p">(</span><span class="k">false</span><span class="p">,</span> <span class="s">"Global\\MyNamedMutex"</span><span class="p">);</span>
</code></pre></div><p>You also can make it <a href="https://msdn.microsoft.com/en-us/library/f55ddskf(v=vs.110).aspx">named</a> (and share the mutex between processes)
which works perfectly on Windows:</p>
<div class="mx-auto">
<img class="mx-auto d-block" width="600" src="https://aakinshin.net/img/posts/dotnet/namedmutex-on-mono/front.png" />
</div>
<p>However, today the .NET Framework is cross-platform, so this code should work on any operation system.
What will happen if you use named mutex on Linux or MacOS with the help of Mono or CoreCLR?
Is it possible to create some tricky bug based on this case?
Of course, it does.
Today I want to tell you a story about such bug in <a href="https://www.jetbrains.com/rider/">Rider</a> which was a headache for several weeks.</p>InvalidDataException in Process.GetProcesseshttps://aakinshin.net/posts/invaliddataexception-in-getprocesses/Fri, 10 Feb 2017 00:00:00 +0000https://aakinshin.net/posts/invaliddataexception-in-getprocesses/<p>Consider the following program:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">static</span> <span class="k">void</span> <span class="n">Main</span><span class="p">(</span><span class="kt">string</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">try</span>
<span class="p">{</span>
<span class="n">Process</span><span class="p">.</span><span class="n">GetProcesses</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">catch</span> <span class="p">(</span><span class="n">Exception</span> <span class="n">e</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">e</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>It seems that all exceptions should be caught.
However, <em>sometimes</em>, I had the following exception on Linux with <code>dotnet cli-1.0.0-preview2</code>:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="err">$</span> <span class="n">dotnet</span> <span class="n">run</span>
<span class="n">System</span><span class="p">.</span><span class="n">IO</span><span class="p">.</span><span class="n">InvalidDataException</span><span class="p">:</span> <span class="n">Found</span> <span class="n">invalid</span> <span class="n">data</span> <span class="k">while</span> <span class="n">decoding</span><span class="p">.</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">IO</span><span class="p">.</span><span class="n">StringParser</span><span class="p">.</span><span class="n">ParseNextChar</span><span class="p">()</span>
<span class="n">at</span> <span class="n">Interop</span><span class="p">.</span><span class="n">procfs</span><span class="p">.</span><span class="n">TryParseStatFile</span><span class="p">(</span><span class="n">String</span> <span class="n">statFilePath</span><span class="p">,</span> <span class="n">ParsedStat</span><span class="p">&</span> <span class="n">result</span><span class="p">,</span> <span class="n">ReusableTextReader</span> <span class="n">reusableReader</span><span class="p">)</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">Diagnostics</span><span class="p">.</span><span class="n">ProcessManager</span><span class="p">.</span><span class="n">CreateProcessInfo</span><span class="p">(</span><span class="n">ParsedStat</span> <span class="n">procFsStat</span><span class="p">,</span> <span class="n">ReusableTextReader</span> <span class="n">reusableReader</span><span class="p">)</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">Diagnostics</span><span class="p">.</span><span class="n">ProcessManager</span><span class="p">.</span><span class="n">CreateProcessInfo</span><span class="p">(</span><span class="n">Int32</span> <span class="n">pid</span><span class="p">,</span> <span class="n">ReusableTextReader</span> <span class="n">reusableReader</span><span class="p">)</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">Diagnostics</span><span class="p">.</span><span class="n">ProcessManager</span><span class="p">.</span><span class="n">GetProcessInfos</span><span class="p">(</span><span class="n">String</span> <span class="n">machineName</span><span class="p">)</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">Diagnostics</span><span class="p">.</span><span class="n">Process</span><span class="p">.</span><span class="n">GetProcesses</span><span class="p">(</span><span class="n">String</span> <span class="n">machineName</span><span class="p">)</span>
<span class="n">at</span> <span class="n">System</span><span class="p">.</span><span class="n">Diagnostics</span><span class="p">.</span><span class="n">Process</span><span class="p">.</span><span class="n">GetProcesses</span><span class="p">()</span>
<span class="n">at</span> <span class="n">DotNetCoreConsoleApplication</span><span class="p">.</span><span class="n">Program</span><span class="p">.</span><span class="n">Main</span><span class="p">(</span><span class="n">String</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="k">in</span> <span class="p">/</span><span class="n">home</span><span class="p">/</span><span class="n">akinshin</span><span class="p">/</span><span class="n">Program</span><span class="p">.</span><span class="n">cs</span><span class="p">:</span><span class="n">line</span> <span class="m">12</span>
</code></pre></div><p>How is that possible?</p>Why is NuGet search in Rider so fast?https://aakinshin.net/posts/rider-nuget-search/Wed, 08 Feb 2017 00:00:00 +0000https://aakinshin.net/posts/rider-nuget-search/<p>I’m the guy who develops the NuGet manager in <a href="https://www.jetbrains.com/rider/">Rider</a>.
It’s not ready yet, there are some bugs here and there, but it already works pretty well.
The feature which I am most proud of is smart and fast search:</p>
<div class="mx-auto">
<img class="mx-auto d-block" width="400" src="https://aakinshin.net/img/posts/dotnet/rider-nuget-search/front.gif" />
</div>
<p>Today I want to share with you some technical details about how it was implemented.</p>NuGet2 and a DirectorySeparatorChar bughttps://aakinshin.net/posts/nuget2-and-directoryseparatorchar/Mon, 06 Feb 2017 00:00:00 +0000https://aakinshin.net/posts/nuget2-and-directoryseparatorchar/<p>In <a href="https://www.jetbrains.com/rider/">Rider</a>, we care a lot about performance.
I like to improve the application responsiveness and do interesting optimizations all the time.
Rider is already well-optimized, and it’s often hard to make significant performance improvements, so usually I do micro-optimizations which do not have a very big impact on the whole application.
However, sometimes it’s possible to improve the speed of a feature 100 times with just a few lines of code.</p>
<p>Rider is based on <a href="https://www.jetbrains.com/resharper/">ReSharper</a>, so we have a lot of cool features out of the box.
One of these features is <a href="https://www.jetbrains.com/help/resharper/2016.3/Code_Analysis__Solution-Wide_Analysis.html">Solution-Wide Analysis</a>
which lets you constantly keep track of issues in your solution.
Sometimes, solution-wide analysis takes a lot of time to run because there are many files which should be analyzed.
Of course, it works super fast on small and projects.</p>
<p>Let’s talk about a performance bug (<a href="https://youtrack.jetbrains.com/issue/RIDER-3742">#RIDER-3742</a>) that we recently had.</p>
<ul>
<li><em>Repro:</em> Open Rider, create a new “ASP .NET MVC Application”, enable solution wide-analysis.</li>
<li><em>Expected:</em> The analysis should take 1 second.</li>
<li><em>Actual:</em> The analysis takes 1 second on Windows and <strong>2 minutes</strong> on Linux and MacOS.</li>
</ul>Performance exercise: Divisionhttps://aakinshin.net/posts/perfex-div/Mon, 26 Dec 2016 00:00:00 +0000https://aakinshin.net/posts/perfex-div/<p>In the previous post, we <a href="https://aakinshin.net/en/blog/dotnet/perfex-min/">discussed</a> the performance space of the minimum function
which was implemented via a simple ternary operator and with the help of bit magic.
Now we continue to talk about performance and bit hacks.
In particular, we will divide a positive number by three:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">uint</span> <span class="n">Div3Simple</span><span class="p">(</span><span class="kt">uint</span> <span class="n">n</span><span class="p">)</span> <span class="p">=></span> <span class="n">n</span> <span class="p">/</span> <span class="m">3</span><span class="p">;</span>
<span class="kt">uint</span> <span class="n">Div3BitHacks</span><span class="p">(</span><span class="kt">uint</span> <span class="n">n</span><span class="p">)</span> <span class="p">=></span> <span class="p">(</span><span class="kt">uint</span><span class="p">)((</span><span class="n">n</span> <span class="p">*</span> <span class="p">(</span><span class="kt">ulong</span><span class="p">)</span><span class="m">0</span><span class="n">xAAAAAAAB</span><span class="p">)</span> <span class="p">>></span> <span class="m">33</span><span class="p">);</span>
</code></pre></div><p>As usual, it’s hard to say which method is faster in advanced because the performance depends on the environment.
Here are some interesting results:</p>
<table class="table table-sm">
<tr> <th></th> <th>Simple</th> <th>BitHacks</th> </tr>
<tr> <th>LegacyJIT-x86</th> <td class="norm">≈8.3ns</td> <td class="fast">≈2.6ns</td> </tr>
<tr> <th>LegacyJIT-x64</th> <td class="fast">≈2.6ns</td> <td class="fast">≈1.7ns</td> </tr>
<tr> <th>RyuJIT-x64 </th> <td class="norm">≈6.9ns</td> <td class="fast">≈1.5ns</td> </tr>
<tr> <th>Mono4.6.2-x86</th> <td class="norm">≈8.5ns</td> <td class="slow">≈14.4ns</td> </tr>
<tr> <th>Mono4.6.2-x64</th> <td class="norm">≈8.3ns</td> <td class="fast">≈2.8ns</td> </tr>
</table>Performance exercise: Minimumhttps://aakinshin.net/posts/perfex-min/Tue, 20 Dec 2016 00:00:00 +0000https://aakinshin.net/posts/perfex-min/<p>Performance is tricky. Especially, if you are working with very fast operations. In today benchmarking exercise, we will try to measure performance of two simple methods which calculate minimum of two numbers. Sounds easy? Ok, let’s do it, here are our guinea pigs for today:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">int</span> <span class="n">MinTernary</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span> <span class="p">=></span> <span class="n">x</span> <span class="p"><</span> <span class="n">y</span> <span class="p">?</span> <span class="n">x</span> <span class="p">:</span> <span class="n">y</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">MinBitHacks</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">,</span> <span class="kt">int</span> <span class="n">y</span><span class="p">)</span> <span class="p">=></span> <span class="n">x</span> <span class="p">&</span> <span class="p">((</span><span class="n">x</span> <span class="p">-</span> <span class="n">y</span><span class="p">)</span> <span class="p">>></span> <span class="m">31</span><span class="p">)</span> <span class="p">|</span> <span class="n">y</span> <span class="p">&</span> <span class="p">(~(</span><span class="n">x</span> <span class="p">-</span> <span class="n">y</span><span class="p">)</span> <span class="p">>></span> <span class="m">31</span><span class="p">);</span>
</code></pre></div><p>And here are some results:</p>
<table class="table table-sm">
<style type="text/css" scoped>
td.slow { color: #ff4444; }
td.fast { color: #00C851; }
</style>
<tr> <th></th> <th colspan="2">Random</th> <th colspan="2">Const</th> </tr>
<tr> <th></th> <th>Ternary</th> <th>BitHacks</th> <th>Ternary</th> <th>BitHacks</th> </tr>
<tr> <th>LegacyJIT-x86</th>
<td class="slow">≈643µs</td>
<td class="fast">≈227µs</td>
<td class="fast">≈160µs</td>
<td class="slow">≈226µs</td>
</tr>
<tr> <th>LegacyJIT-x64</th>
<td class="slow">≈450µs</td>
<td class="fast">≈123µs</td>
<td class="fast">≈68µs</td>
<td class="slow">≈123µs</td>
</tr>
<tr> <th>RyuJIT-x64</th>
<td class="slow">≈594µs</td>
<td class="fast">≈241µs</td>
<td class="fast">≈180µs</td>
<td class="slow">≈241µs</td>
</tr>
<tr> <th>Mono-x64</th>
<td class="fast">≈203µs</td>
<td class="slow">≈283µs</td>
<td class="fast">≈204µs</td>
<td class="slow">≈282µs</td>
</tr>
</table>
<p>What’s going on here? Let’s discuss it in detail.</p>Stopwatch under the hoodhttps://aakinshin.net/posts/stopwatch/Fri, 09 Sep 2016 00:00:00 +0000https://aakinshin.net/posts/stopwatch/<p><strong>Update:</strong>
You can find an updated and significantly improved version of this post in my book <a href="https://aakinshin.net/prodotnetbenchmarking/">“Pro .NET Benchmarking”</a>.</p>
<p>In <a href="https://aakinshin.net/en/blog/dotnet/datetime/">the previous post</a>, we discussed <code>DateTime</code>.
This structure can be used in situations when you don’t need a good level of precision.
If you want to do high-precision time measurements, you need a better tool because <code>DateTime</code> has a small resolution and a big latency.
Also, time is tricky, you can create wonderful bugs if you don’t understand how it works (see <a href="http://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time">Falsehoods programmers believe about time</a> and <a href="http://infiniteundo.com/post/25509354022/more-falsehoods-programmers-believe-about-time">More falsehoods programmers believe about time</a>).</p>
<p>In this post, we will briefly talk about the <a href="https://msdn.microsoft.com/library/system.diagnostics.stopwatch.aspx">Stopwatch</a> class:</p>
<ul>
<li>Which kind of hardware timers could be a base for <code>Stopwatch</code></li>
<li>High precision timestamp API on Windows and Linux</li>
<li>Latency and Resolution of <code>Stopwatch</code> in different environments</li>
<li>Common pitfalls: which kind of problems could we get trying to measure small time intervals</li>
</ul>
<p>If you are not a .NET developer, you can also find a lot of useful information in this post: mainly we will discuss low-level details of high-resolution timestamping (probably your favorite language also uses the same API).
As usual, you can also find useful links for further reading.</p>DateTime under the hoodhttps://aakinshin.net/posts/datetime/Fri, 19 Aug 2016 00:00:00 +0000https://aakinshin.net/posts/datetime/<p><strong>Update:</strong>
You can find an updated and significantly improved version of this post in my book <a href="https://aakinshin.net/prodotnetbenchmarking/">“Pro .NET Benchmarking”</a>.</p>
<p><a href="https://msdn.microsoft.com/library/system.datetime.aspx">DateTime</a> is a widely used .NET type. A lot of developers use it all the time, but not all of them really know how it works. In this post, I discuss <a href="https://msdn.microsoft.com/library/system.datetime.utcnow.aspx">DateTime.UtcNow</a>: how it’s implemented, what the latency and the resolution of <code>DateTime</code> on Windows and Linux, how the resolution can be changed, and how it can affect your application. This post is an overview, so you probably will not see super detailed explanations of some topics, but you will find a lot of useful links for further reading.</p>LegacyJIT-x86 and first method callhttps://aakinshin.net/posts/legacyjitx86-and-first-method-call/Mon, 04 Apr 2016 00:00:00 +0000https://aakinshin.net/posts/legacyjitx86-and-first-method-call/<p>Today I tell you about one of my favorite benchmarks (this method doesn’t return a useful value, we need it only as an example):</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="na">[Benchmark]</span>
<span class="k">public</span> <span class="kt">string</span> <span class="n">Sum</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">double</span> <span class="n">a</span> <span class="p">=</span> <span class="m">1</span><span class="p">,</span> <span class="n">b</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">sw</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Stopwatch</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="m">10001</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">a</span> <span class="p">=</span> <span class="n">a</span> <span class="p">+</span> <span class="n">b</span><span class="p">;</span>
<span class="k">return</span> <span class="kt">string</span><span class="p">.</span><span class="n">Format</span><span class="p">(</span><span class="s">"{0}{1}"</span><span class="p">,</span> <span class="n">a</span><span class="p">,</span> <span class="n">sw</span><span class="p">.</span><span class="n">ElapsedMilliseconds</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>An interesting fact: if you call <code>Stopwatch.GetTimestamp()</code> before the first call of the <code>Sum</code> method, you improve <code>Sum</code> performance several times (works only with LegacyJIT-x86).</p>Visual Studio and ProjectTypeGuids.cshttps://aakinshin.net/posts/projecttypeguids/Sat, 27 Feb 2016 00:00:00 +0000https://aakinshin.net/posts/projecttypeguids/<p>It’s a story about how I tried to open a project in Visual Studio for a few hours. The other day, I was going to do some work. I pulled last commits from a repo, opened Visual Studio, and prepared to start coding. However, one of a project in my solution failed to open with a strange message:</p>
<div class="highlight"><pre class="chroma"><code class="language-txt" data-lang="txt">error : The operation could not be completed.
</code></pre></div><p>In the Solution Explorer, I had <em>“load failed”</em> as a project status and the following message instead of the file tree: <em>“The project requires user input. Reload the project for more information."</em> Hmm, ok, I reloaded the project and got a few more errors:</p>
<div class="highlight"><pre class="chroma"><code class="language-txt" data-lang="txt">error : The operation could not be completed.
error : The operation could not be completed.
</code></pre></div>Blittable typeshttps://aakinshin.net/posts/blittable/Thu, 26 Nov 2015 00:00:00 +0000https://aakinshin.net/posts/blittable/<p>Challenge of the day: what will the following code display?</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="na">[StructLayout(LayoutKind.Explicit)]</span>
<span class="k">public</span> <span class="k">struct</span> <span class="nc">UInt128</span>
<span class="p">{</span>
<span class="na"> [FieldOffset(0)]</span>
<span class="k">public</span> <span class="kt">ulong</span> <span class="n">Value1</span><span class="p">;</span>
<span class="na"> [FieldOffset(8)]</span>
<span class="k">public</span> <span class="kt">ulong</span> <span class="n">Value2</span><span class="p">;</span>
<span class="p">}</span>
<span class="na">[StructLayout(LayoutKind.Sequential)]</span>
<span class="k">public</span> <span class="k">struct</span> <span class="nc">MyStruct</span>
<span class="p">{</span>
<span class="k">public</span> <span class="n">UInt128</span> <span class="n">UInt128</span><span class="p">;</span>
<span class="k">public</span> <span class="kt">char</span> <span class="n">Char</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">class</span> <span class="nc">Program</span>
<span class="p">{</span>
<span class="k">public</span> <span class="k">static</span> <span class="k">unsafe</span> <span class="k">void</span> <span class="n">Main</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">myStruct</span> <span class="p">=</span> <span class="k">new</span> <span class="n">MyStruct</span><span class="p">();</span>
<span class="kt">var</span> <span class="n">baseAddress</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)&</span><span class="n">myStruct</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">uInt128Adress</span> <span class="p">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)&</span><span class="n">myStruct</span><span class="p">.</span><span class="n">UInt128</span><span class="p">;</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">uInt128Adress</span> <span class="p">-</span> <span class="n">baseAddress</span><span class="p">);</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">Marshal</span><span class="p">.</span><span class="n">OffsetOf</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">MyStruct</span><span class="p">),</span> <span class="s">"UInt128"</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>A hint: two zeros or two another same values are wrong answers in the general case. The following table shows the console output on different runtimes:</p>
<table>
<tr><th></th><th>MS.NET-x86</th><th>MS.NET-x64</th><th>Mono</th></tr>
<tr><td>uInt128Adress - baseAddress </td><td>4</td><td>8</td><td>0</td></tr>
<tr><td>Marshal.OffsetOf(typeof(MyStruct), "UInt128")</td><td>0</td><td>0</td><td>0</td></tr>
</table>
<p>If you want to know why it happens, you probably should learn some useful information about blittable types.</p>RyuJIT RC and constant foldinghttps://aakinshin.net/posts/ryujit-rc-and-constant-folding/Tue, 12 May 2015 00:00:00 +0000https://aakinshin.net/posts/ryujit-rc-and-constant-folding/<p><strong>Update:</strong> The below results are valid for the release version of RyuJIT in .NET Framework 4.6 without updates.</p>
<p>The challenge of the day: which method is faster?</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="kt">double</span> <span class="n">Sqrt13</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">2</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">4</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span> <span class="p">+</span>
<span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">6</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">7</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">8</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">9</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">10</span><span class="p">)</span> <span class="p">+</span>
<span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">11</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">12</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">13</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">public</span> <span class="kt">double</span> <span class="n">Sqrt14</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">1</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">2</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">3</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">4</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span> <span class="p">+</span>
<span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">6</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">7</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">8</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">9</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">10</span><span class="p">)</span> <span class="p">+</span>
<span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">11</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">12</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">13</span><span class="p">)</span> <span class="p">+</span> <span class="n">Math</span><span class="p">.</span><span class="n">Sqrt</span><span class="p">(</span><span class="m">14</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>I have measured the methods performance with help of <a href="https://github.com/AndreyAkinshin/BenchmarkDotNet">BenchmarkDotNet</a> for RyuJIT RC (a part of .NET Framework 4.6 RC) and received the following results:</p>
<div class="highlight"><pre class="chroma"><code class="language-md" data-lang="md">// BenchmarkDotNet=v0.7.4.0
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4702MQ CPU ＠ 2.20GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.0, Arch=64-bit [RyuJIT]
Common: Type=Math_DoubleSqrtAvx Mode=Throughput Platform=X64 Jit=RyuJit .NET=Current
Method | AvrTime | StdDev | op/s |
------- |--------- |---------- |------------- |
Sqrt13 | 55.40 ns | 0.571 ns | 18050993.06 |
Sqrt14 | 1.43 ns | 0.0224 ns | 697125029.18 |
</code></pre></div><p>How so? If I add one more <code>Math.Sqrt</code> to the expression, the method starts work 40 times faster! Let’s examine the situation..</p>Unrolling of small loops in different JIT versionshttps://aakinshin.net/posts/unrolling-of-small-loops-in-different-jit-versions/Mon, 02 Mar 2015 00:00:00 +0000https://aakinshin.net/posts/unrolling-of-small-loops-in-different-jit-versions/<p>Challenge of the day: what will the following code display?</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">struct</span> <span class="nc">Point</span>
<span class="p">{</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">X</span><span class="p">;</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">Y</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">static</span> <span class="k">void</span> <span class="n">Print</span><span class="p">(</span><span class="n">Point</span> <span class="n">p</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">X</span> <span class="p">+</span> <span class="s">" "</span> <span class="p">+</span> <span class="n">p</span><span class="p">.</span><span class="n">Y</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="k">void</span> <span class="n">Main</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">p</span> <span class="p">=</span> <span class="k">new</span> <span class="n">Point</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">X</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">p</span><span class="p">.</span><span class="n">X</span> <span class="p"><</span> <span class="m">2</span><span class="p">;</span> <span class="n">p</span><span class="p">.</span><span class="n">X</span><span class="p">++)</span>
<span class="n">Print</span><span class="p">(</span><span class="n">p</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>The right answer: it depends. There is a bug in CLR2 JIT-x86 which spoil this wonderful program. This story is about optimization that called unrolling of small loops. This is a very interesting theme, let’s discuss it in detail.</p>RyuJIT CTP5 and loop unrollinghttps://aakinshin.net/posts/ryujit-ctp5-and-loop-unrolling/Sun, 01 Mar 2015 00:00:00 +0000https://aakinshin.net/posts/ryujit-ctp5-and-loop-unrolling/<p>RyuJIT will be available soon. It is a next generation JIT-compiler for .NET-applications. Microsoft likes to tell us about the benefits of SIMD using and JIT-compilation time reducing. But what about basic code optimization which is usually applying by a compiler? Today we talk about the loop unrolling (unwinding) optimization. In general, in this type of code optimization, the code</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="m">1024</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
</code></pre></div><p>transforms to</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="m">1024</span><span class="p">;</span> <span class="n">i</span> <span class="p">+=</span> <span class="m">4</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">i</span><span class="p">);</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">i</span> <span class="p">+</span> <span class="m">1</span><span class="p">);</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">i</span> <span class="p">+</span> <span class="m">2</span><span class="p">);</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">i</span> <span class="p">+</span> <span class="m">3</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>Such approach can significantly increase performance of your code. So, what’s about loop unrolling in .NET?</p>JIT version determining in runtimehttps://aakinshin.net/posts/jit-version-determining-in-runtime/Sat, 28 Feb 2015 00:00:00 +0000https://aakinshin.net/posts/jit-version-determining-in-runtime/<p>Sometimes I want to know used JIT compiler version in my little C# experiments. It is clear that it is possible to determine the version in advance based on the environment. However, sometimes I want to know it in runtime to perform specific code for the current JIT compiler. More formally, I want to get the value from the following enum:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">enum</span> <span class="n">JitVersion</span>
<span class="p">{</span>
<span class="n">Mono</span><span class="p">,</span> <span class="n">MsX86</span><span class="p">,</span> <span class="n">MsX64</span><span class="p">,</span> <span class="n">RyuJit</span>
<span class="p">}</span>
</code></pre></div><p>It is easy to detect Mono by existing of the <code>Mono.Runtime</code> class. Otherwise, we can assume that we work with Microsoft JIT implementation. It is easy to detect JIT-x86 with help of <code>IntPtr.Size == 4</code>. The challenge is to distinguish JIT-x64 and RyuJIT. Next, I will show how you can do it with help of the bug from my <a href="http://aakinshin.net/en/blog/dotnet/subexpression-elimination-bug-in-jit-x64/">previous post</a>.</p>A bug story about JIT-x64https://aakinshin.net/posts/subexpression-elimination-bug-in-jit-x64/Fri, 27 Feb 2015 00:00:00 +0000https://aakinshin.net/posts/subexpression-elimination-bug-in-jit-x64/<p>Can you say, what will the following code display for <code>step=1</code>?</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">void</span> <span class="n">Foo</span><span class="p">(</span><span class="kt">int</span> <span class="n">step</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">i</span> <span class="p"><</span> <span class="n">step</span><span class="p">;</span> <span class="n">i</span><span class="p">++)</span>
<span class="p">{</span>
<span class="n">bar</span> <span class="p">=</span> <span class="n">i</span> <span class="p">+</span> <span class="m">10</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">j</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span> <span class="n">j</span> <span class="p"><</span> <span class="m">2</span> <span class="p">*</span> <span class="n">step</span><span class="p">;</span> <span class="n">j</span> <span class="p">+=</span> <span class="n">step</span><span class="p">)</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">j</span> <span class="p">+</span> <span class="m">10</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>If you think about specific numbers, you are wrong. The right answer: it depends. The post title suggests to us, the program can has a strange behavior for x64.</p>A story about JIT-x86 inlining and starghttps://aakinshin.net/posts/inlining-and-starg/Thu, 26 Feb 2015 00:00:00 +0000https://aakinshin.net/posts/inlining-and-starg/<p>Sometimes you can learn a lot during reading source .NET. Let’s open the source code of a <code>Decimal</code> constructor from .NET Reference Source (<a href="http://referencesource.microsoft.com/#mscorlib/system/decimal.cs,158">mscorlib/system/decimal.cs,158</a>):</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="c1">// Constructs a Decimal from an integer value.
</span><span class="c1">//
</span><span class="c1"></span><span class="k">public</span> <span class="n">Decimal</span><span class="p">(</span><span class="kt">int</span> <span class="k">value</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// JIT today can't inline methods that contains "starg" opcode.
</span><span class="c1"></span> <span class="c1">// For more details, see DevDiv Bugs 81184: x86 JIT CQ: Removing the inline striction of "starg".
</span><span class="c1"></span> <span class="kt">int</span> <span class="n">value_copy</span> <span class="p">=</span> <span class="k">value</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">value_copy</span> <span class="p">>=</span> <span class="m">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">flags</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span> <span class="p">{</span>
<span class="n">flags</span> <span class="p">=</span> <span class="n">SignMask</span><span class="p">;</span>
<span class="n">value_copy</span> <span class="p">=</span> <span class="p">-</span><span class="n">value_copy</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">lo</span> <span class="p">=</span> <span class="n">value_copy</span><span class="p">;</span>
<span class="n">mid</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="n">hi</span> <span class="p">=</span> <span class="m">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div><p>The comment states that JIT-x86 can’t apply the inlining optimization for a method that contains the <a href="https://msdn.microsoft.com/library/system.reflection.emit.opcodes.starg.aspx">starg</a> IL-opcode. Curious, is not it?</p>About UTF-8 conversions in Monohttps://aakinshin.net/posts/mono-utf8-conversions/Mon, 10 Nov 2014 00:00:00 +0000https://aakinshin.net/posts/mono-utf8-conversions/<p>This post is a logical continuation of the Jon Skeet’s blog post <a href="http://codeblog.jonskeet.uk/2014/11/07/when-is-a-string-not-a-string">“When is a string not a string?”</a>. Jon showed very interesting things about behavior of ill-formed Unicode strings in .NET. I wondered about how similar examples will work on Mono. And I have got very interesting results.</p>
<h3 id="experiment-1-compilation">Experiment 1: Compilation</h3>
<p>Let’s take the Jon’s code with a small modification. We will just add <code>text</code> null check in <code>DumpString</code>:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">using</span> <span class="nn">System</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">System.ComponentModel</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">System.Text</span><span class="p">;</span>
<span class="k">using</span> <span class="nn">System.Linq</span><span class="p">;</span>
<span class="na">[Description(Value)]</span>
<span class="k">class</span> <span class="nc">Test</span>
<span class="p">{</span>
<span class="k">const</span> <span class="kt">string</span> <span class="n">Value</span> <span class="p">=</span> <span class="s">"X\ud800Y"</span><span class="p">;</span>
<span class="k">static</span> <span class="k">void</span> <span class="n">Main</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">description</span> <span class="p">=</span> <span class="p">(</span><span class="n">DescriptionAttribute</span><span class="p">)</span><span class="k">typeof</span><span class="p">(</span><span class="n">Test</span><span class="p">).</span>
<span class="n">GetCustomAttributes</span><span class="p">(</span><span class="k">typeof</span><span class="p">(</span><span class="n">DescriptionAttribute</span><span class="p">),</span> <span class="k">true</span><span class="p">)[</span><span class="m">0</span><span class="p">];</span>
<span class="n">DumpString</span><span class="p">(</span><span class="s">"Attribute"</span><span class="p">,</span> <span class="n">description</span><span class="p">.</span><span class="n">Description</span><span class="p">);</span>
<span class="n">DumpString</span><span class="p">(</span><span class="s">"Constant"</span><span class="p">,</span> <span class="n">Value</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">static</span> <span class="k">void</span> <span class="n">DumpString</span><span class="p">(</span><span class="kt">string</span> <span class="n">name</span><span class="p">,</span> <span class="kt">string</span> <span class="n">text</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">Console</span><span class="p">.</span><span class="n">Write</span><span class="p">(</span><span class="s">"{0}: "</span><span class="p">,</span> <span class="n">name</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">text</span> <span class="p">!=</span> <span class="k">null</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">var</span> <span class="n">utf16</span> <span class="p">=</span> <span class="n">text</span><span class="p">.</span><span class="n">Select</span><span class="p">(</span><span class="n">c</span> <span class="p">=></span> <span class="p">((</span><span class="kt">uint</span><span class="p">)</span> <span class="n">c</span><span class="p">).</span><span class="n">ToString</span><span class="p">(</span><span class="s">"x4"</span><span class="p">));</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="kt">string</span><span class="p">.</span><span class="n">Join</span><span class="p">(</span><span class="s">" "</span><span class="p">,</span> <span class="n">utf16</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="s">"null"</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>Happy Monday!https://aakinshin.net/posts/happy-monday/Mon, 11 Aug 2014 00:00:00 +0000https://aakinshin.net/posts/happy-monday/<p>Today I tell you a story about one tricky bug. The bug is a tricky one because it doesn’t allow me to debug my application on Mondays. I’m serious right now: the debug mode doesn’t work every Monday. Furthermore, the bug literally tell me: “Happy Monday!”.</p>
<p>So, the story. It was a wonderful Sunday evening, no signs of trouble. We planned to release a new version of our software (a minor one, but it includes some useful features). Midnight on the clock. Suddenly, I came up with the idea that we have a minor bug that should be fixed. It requires a few lines of code and 10 minutes to do it. And I decided to write needed logic before I go to sleep. I open VisualStudio, lunch build, and wait. But something goes wrong, because I get the following error:</p>
<div class="highlight"><pre class="chroma"><code class="language-txt" data-lang="txt">Error connecting to the pipe server.
</code></pre></div><p>Hmm. It is a strange error.</p>To Refactor Or Not To Refactor?https://aakinshin.net/posts/refactoring/Sat, 19 Jul 2014 00:00:00 +0000https://aakinshin.net/posts/refactoring/<p>I like refactoring. No, I love refactoring. No, not even like this. I awfully love refactoring.</p>
<p>I hate bad code and bad architecture. I feel quite creepy when I design a new feature and the near-by class contains absolute mess. I just can’t look at the sadly-looking variables. Sometimes before falling asleep I close my eyes and imagine what could be improved in the project. Sometimes I wake up at 3:00AM and go to my computer to improve something. I want to have not just code, but a masterpiece that is pleasant to look at, that is pleasant to work with at any stage of the project.</p>
<p>If you just a little bit share my feelings we have something to talk about. The matter is that over some time something inside me began to hint that it’s a bad idea to refactor all code, everywhere and all the time. Understand me correctly – code should be good (even better when it’s ideal), but in real life it’s not reasonable to improve code instantly. I formed some rules about the refactoring timeliness. If I am itching to improve something, I look at these rules and think “Is that the moment when I need to refactor the code?” So, let’s talk about when refactoring is necessary and when it’s inappropriate.</p>Strange behavior of FindElementsInHostCoordinates in WinRThttps://aakinshin.net/posts/findelementsinhostcoordinates/Tue, 29 Apr 2014 00:00:00 +0000https://aakinshin.net/posts/findelementsinhostcoordinates/<p>Silverlight features a splendid method: <a href="http://msdn.microsoft.com/en-us/library/system.windows.media.visualtreehelper.findelementsinhostcoordinates(v=vs.95).aspx">VisualTreeHelper.FindElementsInHostCoordinates</a>. It allows the <code>HitTest</code>, i.e. makes it possible for a point or rectangle to search for all visual sub-tree objects that intersect this rectangle or point. Formally the same method <a href="http://msdn.microsoft.com/en-us/library/windows/apps/windows.ui.xaml.media.visualtreehelper.findelementsinhostcoordinates.aspx">VisualTreeHelper.FindElementsInHostCoordinates</a> is available in WinRT. And it seems the method looks in the same way, but there is a little nuance. It works differently in different versions of the platform. So, let’s see what’s going on.</p>About System.Drawing.Color and operator ==https://aakinshin.net/posts/system-drawing-color-equals/Fri, 21 Feb 2014 00:00:00 +0000https://aakinshin.net/posts/system-drawing-color-equals/<p>Operator <code>==</code> that allows easy comparison of your objects is overridden for many standard structures in .NET. Unfortunately, not every developer really knows what is actually compared when working with this wonderful operator. This brief blog post will show the comparison logic based on a sample of <code>System.Drawing.Color</code>. What do you think the following code will get:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">var</span> <span class="n">redName</span> <span class="p">=</span> <span class="n">Color</span><span class="p">.</span><span class="n">Red</span><span class="p">;</span>
<span class="kt">var</span> <span class="n">redArgb</span> <span class="p">=</span> <span class="n">Color</span><span class="p">.</span><span class="n">FromArgb</span><span class="p">(</span><span class="m">255</span><span class="p">,</span> <span class="m">255</span><span class="p">,</span> <span class="m">0</span><span class="p">,</span> <span class="m">0</span><span class="p">);</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">redName</span> <span class="p">==</span> <span class="n">redArgb</span><span class="p">);</span>
</code></pre></div>Setting up build configuration in .NEThttps://aakinshin.net/posts/msbuild-configurations/Sat, 08 Feb 2014 00:00:00 +0000https://aakinshin.net/posts/msbuild-configurations/<p>You get two default build configurations: Debug and Release, when creating a new project in Visual Studio. And it’s enough for most small projects. But there can appear a necessity to extend it with the additional configurations. It’s ok if you need to add just a couple of new settings, but what if there are tens of such settings? And what if your solution contains 20 projects that need setting up of these configurations? In this case it becomes quite difficult to manage and modify build parameters.</p>
<p>In this article, we will review a way to make this process simpler by reducing description of the build configurations.</p>Jon Skeet's Quizhttps://aakinshin.net/posts/jon-skeet-quiz/Sun, 03 Nov 2013 00:00:00 +0000https://aakinshin.net/posts/jon-skeet-quiz/<p>Jon Skeet was once asked to give three questions to check how well you know C#. He asked the <a href="http://www.dotnetcurry.com/magazine/jon-skeet-quiz.aspx">following questions</a>:</p>
<ul>
<li><strong>Q1.</strong> <em>What constructor call can you write such that this prints True (at least on the Microsoft .NET implementation)?</em></li>
</ul>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">object</span> <span class="n">x</span> <span class="p">=</span> <span class="k">new</span> <span class="cm">/* fill in code here */</span><span class="p">;</span>
<span class="kt">object</span> <span class="n">y</span> <span class="p">=</span> <span class="k">new</span> <span class="cm">/* fill in code here */</span><span class="p">;</span>
<span class="n">Console</span><span class="p">.</span><span class="n">WriteLine</span><span class="p">(</span><span class="n">x</span> <span class="p">==</span> <span class="n">y</span><span class="p">);</span>
</code></pre></div><p><em>Note that it’s just a constructor call, and you can’t change the type of the variables.</em></p>
<ul>
<li><strong>Q2.</strong> <em>How can you make this code compile such that it calls three different method overloads?</em></li>
</ul>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">void</span> <span class="n">Foo</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">EvilMethod</span><span class="p"><</span><span class="kt">string</span><span class="p">>();</span>
<span class="n">EvilMethod</span><span class="p"><</span><span class="kt">int</span><span class="p">>();</span>
<span class="n">EvilMethod</span><span class="p"><</span><span class="kt">int?</span><span class="p">>();</span>
<span class="p">}</span>
</code></pre></div><ul>
<li><strong>Q3.</strong> <em>With a local variable (so no changing the variable value cunningly), how can you make this code fail on the second line?</em></li>
</ul>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="kt">string</span> <span class="n">text</span> <span class="p">=</span> <span class="n">x</span><span class="p">.</span><span class="n">ToString</span><span class="p">();</span> <span class="c1">// No exception
</span><span class="c1"></span><span class="n">Type</span> <span class="n">type</span> <span class="p">=</span> <span class="n">x</span><span class="p">.</span><span class="n">GetType</span><span class="p">();</span> <span class="c1">// Bang!
</span></code></pre></div><p>These questions seemed interesting to me, that is why I decided to discuss the solutions.</p>Perfect code and real projectshttps://aakinshin.net/posts/perfect-code-and-real-projects/Wed, 28 Aug 2013 00:00:00 +0000https://aakinshin.net/posts/perfect-code-and-real-projects/I’ve got a problem. I am a perfectionist. I like perfect code. This is not only the correct way to develop applications but also the real proficiency. I enjoy reading a good listing not less than reading a good book. Developing architecture of a big project is no simpler than designing architecture of a big building. In case the work is good the result is no less beautiful. I am sometimes fascinated by how elegantly the patterns are entwined in the perfect software system. I am delighted by the attention to details when every method is so simple and understandable that can be a classic sample of the perfect code.
But, unfortunately, this splendor is ruined by stern reality and real projects. If we talk about production project, users don’t care how beautiful your code is and how wonderful your architecture is, they care to have a properly working project. But I still think that in any case you need to strive for writing good code, but without getting stuck on this idea. After reading various holy-war discussions related to correct approaches to writing code I noticed a trend: everyone tries to apply the mentioned approaches not to programming in general, but to personal development experience, to their own projects. Many developers don’t understand that good practice is not an absolute rule that should be followed in 100% of scenarios. It’s just an advice on what to do in most cases. You can get a dozen of scenarios where the practice won’t work at all. But it doesn’t mean that the approach is not that good, it’s just used in the wrong environment.
There is another problem: some developers are not that good as they think. I often see the following situation: such developer got some idea (without getting deep into details) in the big article about the perfect code and he started to use it everywhere and the developer’s code became even worse.To Add Comments or Not to Add?https://aakinshin.net/posts/comments/Wed, 28 Aug 2013 00:00:00 +0000https://aakinshin.net/posts/comments/<p><em>A really good comment is the one you managed to avoid. (c) Uncle Bob</em></p>
<p>Lately, I’ve been feeling really tired of hot discussions on if it’s necessary to add comments in the code. As a rule, there are self-confident juniors with the indisputable statement as: “Why not to comment it, it will be unreadable without the comments!” on one side. And experienced seniors are on the other side. They understand that if it’s possible to go without the comments than “You better, damn it, do it in this way!” Probably, many developers got comment cravings since they’ve been students when professors made them comment every code line, “to make the student better understand it”. Real projects shouldn’t contain a lot of comments that only spoil the code. I don’t agitate for avoiding comments at all, but if you managed to write the code that doesn’t need comments, you can consider it your small victory. I would like to refer you to some good books that helped form my position. I like and respect these authors and completely share their opinion.</p>
<ul>
<li><a href="http://www.amazon.com/Code-Complete-Practical-Handbook-Construction/dp/0735619670">Steven C. McConnell, Code Complete</a></li>
<li><a href="http://www.amazon.com/Clean-Code-Handbook-Software-Craftsmanship/dp/0132350882">Robert Martin, Clean Code: A Handbook of Agile Software Craftsmanship</a></li>
<li><a href="http://www.amazon.com/The-Readable-Code-Theory-Practice/dp/0596802293">Dustin Boswell, Trevor Foucher, The Art of Readable Code (Theory in Practice)</a></li>
</ul>Unexpected area to collect garbage in .NEThttps://aakinshin.net/posts/gc-native/Thu, 08 Aug 2013 00:00:00 +0000https://aakinshin.net/posts/gc-native/<p>The .NET framework provides an intelligent garbage collector that saves us a trouble of manual memory management. And in 95% of cases you can forget about memory and related issues. But the remaining 5% have some specific aspects connected to unmanaged resources, too big objects, etc. And it’s better to know how the garbage is collected. Otherwise, you can get surprises.</p>
<p>Do you think GC is able to collect an object till its last method is complete? It appears it is. But it is necessary to run an application in release mode without debugging. In this case JIT compiler will perform optimizations that will make this situation possible. Of course, JIT compiler does it when the remaining method body doesn’t contain references to the object or its fields. It should seem a very harmless optimization. But it can lead to the problems if you work with the unmanaged resources: object compilation can be executed before the operation over the unmanaged resource is finished. And most likely it will result in the application crash.</p>Unobviousness in use of C# closureshttps://aakinshin.net/posts/closures/Wed, 07 Aug 2013 00:00:00 +0000https://aakinshin.net/posts/closures/<p>C# gives us an ability to use closures. This is a powerful tool that allows anonymous methods and lambda-functions to capture unbound variables in their lexical scope. And many programmers in .NET world like using closures very much, but only few of them understand how they really work. Let’s start with a simple sample:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">void</span> <span class="n">Run</span><span class="p">()</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">e</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">x</span> <span class="p">=></span> <span class="n">x</span> <span class="p">+</span> <span class="n">e</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div><p>Nothing complicated happens here: we just captured a local variable <code>e</code> in its lambda that is passed to some <code>Foo</code> method. Let’s see how the compiler will expand such construction.*</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">void</span> <span class="n">Run</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">DisplayClass</span> <span class="n">c</span> <span class="p">=</span> <span class="k">new</span> <span class="n">DisplayClass</span><span class="p">();</span>
<span class="n">c</span><span class="p">.</span><span class="n">e</span> <span class="p">=</span> <span class="m">1</span><span class="p">;</span>
<span class="n">Foo</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">Action</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">private</span> <span class="k">sealed</span> <span class="k">class</span> <span class="nc">DisplayClass</span>
<span class="p">{</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">e</span><span class="p">;</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">Action</span><span class="p">(</span><span class="kt">int</span> <span class="n">x</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">x</span> <span class="p">+</span> <span class="n">e</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div>Wrapping C# class for use in COMhttps://aakinshin.net/posts/wrap-cs-in-com/Mon, 03 Jun 2013 00:00:00 +0000https://aakinshin.net/posts/wrap-cs-in-com/<p>Let us have a C# class that makes something useful, for example:</p>
<div class="highlight"><pre class="chroma"><code class="language-cs" data-lang="cs"><span class="k">public</span> <span class="k">class</span> <span class="nc">Calculator</span>
<span class="p">{</span>
<span class="k">public</span> <span class="kt">int</span> <span class="n">Sum</span><span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">a</span> <span class="p">+</span> <span class="n">b</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div><p>Let’s create a <a href="http://ru.wikipedia.org/wiki/Component_Object_Model">COM</a> interface for this class to make it possible to use its functionality in other areas. At the end we will see how this class is used in Delphi environment.</p>