Fence-based outlier detectors, Part 1

by Andrey Akinshin · 2022-03-29

In previous posts, I discussed properties of Tukey’s fences and asymmetric decile-based outlier detector (Part 1, Part 2). In this post, I discuss the generalization of fence-based outlier detectors.

Notation

A Symmetric Fence-based outlier detector could be defined using the following range:

$$ SF(p, k) = [Q_p - k(Q_{1-p} - Q_p), Q_{1-p} + k(Q_{1-p} - Q_p)] $$

where $Q_s$ is an estimation of the $s^\textrm{th}$ quantile, $p \in [0, 0.5]$.

All the sample elements outside this range are marked as outliers. Using this notation, Tukey’s fences could be defined as $SF(0.25, k)$.

An Asymmetric Fence-based outlier detector is defined using the following range:

$$ AF(p, k) = [Q_p - 2k(Q_{0.5} - Q_{p}), Q_{1-p} + 2k(Q_{1-p} - Q_{0.5})]. $$

An asymmetric decile-based outlier detector could be defined as $AF(0.1, k)$.

Simulation 1

Let’s perform the following experiment:

  • Enumerate two types of fence-based outlier detectors: asymmetric and symmetric.
  • Enumerate different $p$ values: $0.1$ (deciles) and $0.25$ (quartiles).
  • Enumerate different $k$ values: $1.0$, $1.5$, $2.0$, $2.5$, $3.0$, $3.5$, $4.0$.
  • Enumerate different distributions: the normal distribution, the exponential distribution, the Gumbel distribution.
  • For each combination of the above parameters, estimate the fence values assuming that $Q_s$ is the true value of $s^\textrm{th}$ quantile. Next, calculate the portion of the distribution outside the fences. Thus, we get the probability of observing a single outlier.

The results are below.

Normal distribution:

typepkoutliers
AF0.101.00.00012072230941067211
AF0.101.50.00000029563871592760
AF0.102.00.00000000014767521496
AF0.102.50.00000000000001483506
AF0.103.00.00000000000000000015
AF0.103.50.00000000000000000000
AF0.104.00.00000000000000000000
AF0.251.00.04302479073838957196
AF0.251.50.00697660323928020812
AF0.252.00.00074502950319118666
AF0.252.50.00005189186759204938
AF0.253.00.00000234194246287498
AF0.253.50.00000006817407812073
AF0.254.00.00000000127585892633
SF0.101.00.00012072230941067211
SF0.101.50.00000029563871592760
SF0.102.00.00000000014767521496
SF0.102.50.00000000000001483506
SF0.103.00.00000000000000000015
SF0.103.50.00000000000000000000
SF0.104.00.00000000000000000000
SF0.251.00.04302479073838957196
SF0.251.50.00697660323928020812
SF0.252.00.00074502950319118666
SF0.252.50.00005189186759204938
SF0.253.00.00000234194246287498
SF0.253.50.00000006817407812073
SF0.254.00.00000000127585892633

Exponential distribution:

typepkoutliers
AF0.101.00.00400000000
AF0.101.50.00080000000
AF0.102.00.00016000000
AF0.102.50.00003200000
AF0.103.00.00000640000
AF0.103.50.00000128000
AF0.104.00.00000025600
AF0.251.00.06250000000
AF0.251.50.03125000000
AF0.252.00.01562500000
AF0.252.50.00781250000
AF0.253.00.00390625000
AF0.253.50.00195312500
AF0.254.00.00097656250
SF0.101.00.01111111111
SF0.101.50.00370370370
SF0.102.00.00123456790
SF0.102.50.00041152263
SF0.103.00.00013717421
SF0.103.50.00004572474
SF0.104.00.00001524158
SF0.251.00.08333333333
SF0.251.50.04811252243
SF0.252.00.02777777778
SF0.252.50.01603750748
SF0.253.00.00925925926
SF0.253.50.00534583583
SF0.254.00.00308641975

Gumbel distribution:

typepkoutliers
AF0.101.00.00243138782251
AF0.101.50.00036996004078
AF0.102.00.00005624389383
AF0.102.50.00000854944973
AF0.103.00.00000129954752
AF0.103.50.00000019753535
AF0.104.00.00000003002599
AF0.251.00.05225343350821
AF0.251.50.02037237984519
AF0.252.00.00849982291839
AF0.252.50.00353655486394
AF0.253.00.00146932399013
AF0.253.50.00061008683006
AF0.254.00.00025325410919
SF0.101.00.00480943027492
SF0.101.50.00103073558090
SF0.102.00.00022057403284
SF0.102.50.00004718708372
SF0.103.00.00001009397656
SF0.103.50.00000215921115
SF0.104.00.00000046187726
SF0.251.00.05920771183234
SF0.251.50.02682956730034
SF0.252.00.01231232518267
SF0.252.50.00562770388698
SF0.253.00.00256759594377
SF0.253.50.00117046709217
SF0.254.00.00053336722084

Simulation 2

Let’s perform the following experiment:

  • Enumerate two types of fence-based outlier detectors: asymmetric and symmetric.
  • Enumerate different $p$ values: $0.1$ (deciles) and $0.25$ (quartiles).
  • Enumerate different $k$ values: $1.0$, $1.5$, $2.0$, $2.5$, $3.0$, $3.5$, $4.0$.
  • Enumerate different distributions: the normal distribution, the exponential distribution, the Gumbel distribution.
  • Enumerate different sample sizes $n$: $5$, $10$, $50$, $100$, $500$, $1000$.
  • For each combination of the above parameters, estimate the fence values assuming that $Q_s$ is the true value of $s^\textrm{th}$ quantile. Next, calculate the probability of having outliers for the given sample size $n$.

The results are below.

Normal distribution, SF, p=0.1:

k/n510501005001000
10.00060.001210.006020.012000.058580.11373
1.50.00000.000000.000010.000030.000150.00030
20.00000.000000.000000.000000.000000.00000
2.50.00000.000000.000000.000000.000000.00000
30.00000.000000.000000.000000.000000.00000
3.50.00000.000000.000000.000000.000000.00000
40.00000.000000.000000.000000.000000.00000

Normal distribution, SF, p=0.25:

k/n510501005001000
10.197390.355820.889070.987701.000001.00000
1.50.034400.067620.295350.503470.969820.99909
20.003720.007430.036580.071820.311100.52541
2.50.000260.000520.002590.005180.025610.05057
30.000010.000020.000120.000230.001170.00234
3.50.000000.000000.000000.000010.000030.00007
40.000000.000000.000000.000000.000000.00000

Normal distribution, AF, p=0.1:

k/n510501005001000
10.00060.001210.006020.012000.058580.11373
1.50.00000.000000.000010.000030.000150.00030
20.00000.000000.000000.000000.000000.00000
2.50.00000.000000.000000.000000.000000.00000
30.00000.000000.000000.000000.000000.00000
3.50.00000.000000.000000.000000.000000.00000
40.00000.000000.000000.000000.000000.00000

Normal distribution, AF, p=0.25:

k/n510501005001000
10.197390.355820.889070.987701.000001.00000
1.50.034400.067620.295350.503470.969820.99909
20.003720.007430.036580.071820.311100.52541
2.50.000260.000520.002590.005180.025610.05057
30.000010.000020.000120.000230.001170.00234
3.50.000000.000000.000000.000010.000030.00007
40.000000.000000.000000.000000.000000.00000

Exponential distribution, SF, p=0.1:

k/n510501005001000
10.054330.105720.428030.672850.996250.99999
1.50.018380.036430.169340.310000.843590.97554
20.006160.012280.059900.116210.460800.70926
2.50.002060.004110.020370.040330.186010.33742
30.000690.001370.006840.013620.066290.12819
3.50.000230.000460.002280.004560.022600.04470
40.000080.000150.000760.001520.007590.01513

Exponential distribution, SF, p=0.25:

k/n510501005001000
10.352770.581100.987100.999831.000001.00000
1.50.218500.389260.915030.992781.000001.00000
20.131380.245510.755500.940221.000001.00000
2.50.077660.149280.554420.801460.999691.00000
30.045450.088830.371940.605540.990450.99991
3.50.026440.052190.235100.414930.931440.99530
40.015340.030440.143210.265910.786820.95455

Exponential distribution, AF, p=0.1:

k/n510501005001000
10.019840.039290.181600.330220.865210.98183
1.50.003990.007970.039230.076910.329790.55081
20.000800.001600.007970.015870.076890.14787
2.50.000160.000320.001600.003190.015870.03149
30.000030.000060.000320.000640.003190.00638
3.50.000010.000010.000060.000130.000640.00128
40.000000.000000.000010.000030.000130.00026

Exponential distribution, AF, p=0.25:

k/n510501005001000
10.275800.475540.960320.998431.000001.00000
1.50.146780.272020.795550.958201.000001.00000
20.075720.145710.544980.792960.999621.00000
2.50.038460.075430.324400.543570.980190.99961
30.019380.038380.177740.323880.858710.98004
3.50.009730.019360.093130.177580.623760.85844
40.004870.009720.047680.093080.386470.62358

Gumbel distribution, SF, p=0.1:

k/n510501005001000
10.023820.047070.214200.382520.910230.99194
1.50.005140.010260.050260.097990.402880.64345
20.001100.002200.010970.021820.104430.19796
2.50.000240.000470.002360.004710.023320.04609
30.000050.000100.000500.001010.005030.01004
3.50.000010.000020.000110.000220.001080.00216
40.000000.000000.000020.000050.000230.00046

Gumbel distribution, SF, p=0.25:

k/n510501005001000
10.263000.456830.952720.997761.000001.00000
1.50.127140.238120.743290.934101.000001.00000
20.060060.116520.461750.710290.997961.00000
2.50.027820.054870.245860.431280.940500.99646
30.012770.025380.120630.226700.723470.92353
3.50.005840.011640.056880.110520.443220.68999
40.002660.005320.026320.051950.234140.41346

Gumbel distribution, AF, p=0.1:

k/n510501005001000
10.012100.024050.114600.216070.703930.91235
1.50.001850.003690.018330.036330.168910.30929
20.000280.000560.002810.005610.027730.05469
2.50.000040.000090.000430.000850.004270.00851
30.000010.000010.000060.000130.000650.00130
3.50.000000.000000.000010.000020.000100.00020
40.000000.000000.000000.000000.000020.00003

Gumbel distribution, AF, p=0.25:

k/n510501005001000
10.235350.415310.931670.995331.000001.00000
1.50.097800.186030.642690.872330.999971.00000
20.041780.081820.347410.574130.985990.99980
2.50.017560.034810.162340.298320.829910.97107
30.007330.014600.070880.136740.520590.77017
3.50.003050.006080.030050.059200.262980.45680
40.001270.002530.012580.025010.118950.22375