Discrete performance distributions

Andrey Akinshin · 2021-06-15

When we collect software performance measurements, we get a bunch of time intervals. Typically, we tend to interpret time values as continuous values. However, the obtained values are actually discrete due to the limited resolution of our measurement tool. In simple cases, we can treat these discrete values as continuous and get meaningful results. Unfortunately, discretization may produce strange phenomena like pseudo-multimodality or zero dispersion. If we want to set up a reliable system that automatically analyzes such distributions, we should be aware of such problems so we could correctly handle them.

In this post, I want to share a few of discretization problems in real-life performance data sets (based on the Rider performance tests).

General discretization

The most simple way to get discretization phenomena is to limit the resolution of your performance measurements to milliseconds. It’s a perfectly legitimate approach when you don’t care about better precision. As a result, we can obtain a timeline plot like this:

And here is the corresponding histogram (binwidth = 1):

This histogram can be treated as an estimation of the probability mass function of the underlying discrete distribution. However, if we try to treat this distribution as a continuous one and build a density estimation, we can get a corrupted visualization. This problem can be resolved for different kinds of density estimations using jittering (I already blogged about how to do it for the kernel density estimation (KDE) and quantile-respectful density estimation (QRDE)).

Zero median and zero MAD

Here is another timeline plot based on real-life performance measurements:

And here is the corresponding histogram:

It’s hard to work with such a histogram because of the scale, so here is its raw data:

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
1996   92   45   23   20   21   14    5   11    5    8    8    3    3    9    5 
  16   17   18   19   20   21   22   23   25   26   27   31   33   34   35   36 
   3    5    1    1    3    5    4    1    4    1    1    1    1    1    1    2 
  37   41   46   49   56   62   63   65   71   72   73   85   91   94   95   97 
   1    2    1    2    1    1    2    1    1    1    1    2    1    1    1    1 
  98  100  102  103  107  109  114  117  119  124  125  126  132  136  138  140 
   1    1    1    1    1    1    1    1    1    2    2    1    1    1    1    2 
 143  146  147  148  152  153  158  160  161  162  163  164  165  166  167  168 
   1    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
 172  173  175  177  178  179  183  184  185  186  187  188  189  190  196  199 
   1    1    1    1    1    1    1    1    1    2    1    1    1    1    2    1 
 201  203  204  206  209  211  215  217  218  223  224  231  238  242  243  246 
   1    1    1    1    1    2    1    1    1    1    1    1    1    1    1    1 
 260  261  262  263  264  265  273  288  289  295  297  298  303  309  313  320 
   1    1    1    1    1    1    1    1    1    1    2    1    2    1    1    1 
 347  350 
   1    1

These numbers mean the follow: we observed 0ms 1996 times, 1ms 92 times, 2ms 45 times, and so on. And here are the same set of numbers scaled to percents:

    0     1     2     3     4     5     6     7     8     9    10    11    12 
82.68  3.81  1.86  0.95  0.83  0.87  0.58  0.21  0.46  0.21  0.33  0.33  0.12 
   13    14    15    16    17    18    19    20    21    22    23    25    26 
 0.12  0.37  0.21  0.12  0.21  0.04  0.04  0.12  0.21  0.17  0.04  0.17  0.04 
   27    31    33    34    35    36    37    41    46    49    56    62    63 
 0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.08  0.04  0.08  0.04  0.04  0.08 
   65    71    72    73    85    91    94    95    97    98   100   102   103 
 0.04  0.04  0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  107   109   114   117   119   124   125   126   132   136   138   140   143 
 0.04  0.04  0.04  0.04  0.04  0.08  0.08  0.04  0.04  0.04  0.04  0.08  0.04 
  146   147   148   152   153   158   160   161   162   163   164   165   166 
 0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  167   168   172   173   175   177   178   179   183   184   185   186   187 
 0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.08  0.04 
  188   189   190   196   199   201   203   204   206   209   211   215   217 
 0.04  0.04  0.04  0.08  0.04  0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.04 
  218   223   224   231   238   242   243   246   260   261   262   263   264 
 0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04  0.04 
  265   273   288   289   295   297   298   303   309   313   320   347   350 
 0.04  0.04  0.04  0.04  0.04  0.08  0.04  0.08  0.04  0.04  0.04  0.04  0.04

Thus, we observed 0ms in the 92.69% cases, 1ms in the 3.81% cases, 2ms in 1.86% cases, and so on. In this data set, both the median and the median absolute deviations are zero. It produces some complications if you use the median as the primary summary metric.

There are two main ways to resolve this problem. The first one is to work with higher quantiles instead of the median. In this case, we can describe the dispersion using the quantile absolute deviation (QAD) instead of the median absolute deviation (MAD). The second way is to build a model that describes the underlying distribution. In the above example, the observed data could be perfectly described using the discrete Weibull distribution. However, this approach is much more complicated in the general case and it requires decent skills in the extreme value theory.

Dirac delta function

And here is one more timeline plot:

All the observed values are equal, so the corresponding histogram contains a single bin:

This constant distribution can be described using the Dirac delta function. Non-zero constant distribution is not so typical for time measurements, but it’s a common pattern for advanced performance metrics like the number of garbage collections or the final memory footprint.

Conclusion

The real-life samples are not always easy to analyze. There is always a risk of getting a “corner case” distribution (e.g., with zero dispersion) that will completely break your “general case” equations. If you want to set up an automatic system that works with real-life data, it’s always a good idea to think about such cases in advance and implement corresponding checks.