Fast Computation of the Shamos Estimator via Monahan-Style Selection
The Shamos scale estimator (
Geometry and Statistics: Problems at the Interface
By Michael Ian Shamos
·
1976shamos1976) is the median of all pairwise absolute differences.
Given a sample $\mathbf{x}=(x_1,\dots,x_n)$, define
It is robust to outliers and easy to interpret. The usual Gaussian-consistent version multiplies $\operatorname{Shamos}$ by
$$ c_\infty=\frac{1}{\sqrt{2}\,\Phi^{-1}(0.75)}\approx 1.048358, $$so that $c_\infty\cdot\operatorname{Shamos}$ estimates the standard deviation under normal data.
A naïve algorithm enumerates all $N=\tfrac{n(n-1)}{2}$ pairs, stores their absolute differences, and sorts them: $O(n^2)$ memory and $O(n^2\log n)$ time. For $n=10{,}000$, that’s $\sim50$ million differences — wasteful.
We can do better. The same structural trick that accelerates Hodges–Lehmann (
Algorithm 616: fast computation of the Hodges-Lehmann location estimator
By John F Monahan
·
1984-08-28monahan1984, Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm
By Andrey Akinshin
·
2025-09-23Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm) also yields an exact $O(n\log n)$ algorithm for Shamos.
The Problem Structure
Sort once: $y_1\le \cdots \le y_n$. Then $|x_i-x_j|$ with $i
Consider the implicit upper-triangular matrix
$$ D_{i,j} = y_j - y_i,\qquad 1\le i- Row monotonicity. For fixed $i$, $D_{i,i+1} \le D_{i,i+2} \le \cdots \le D_{i,n}$.
- Column monotonicity (in $i$). For fixed $j$, as $i$ increases, $y_j-y_i$ decreases.
- No need to materialize $D$. We can count how many entries are below a pivot in $O(n)$ by a single two-pointer sweep across rows.
The median of all pairwise absolute differences is simply the median of this implicit matrix.
Selection Instead of Sorting
We seek the $k$-th order statistic among the $N=\tfrac{n(n-1)}{2}$ differences. Let
$$ k_\text{low}=\left\lfloor\frac{N+1}{2}\right\rfloor,\qquad k_\text{high}=\left\lfloor\frac{N+2}{2}\right\rfloor. $$For even $N$ the answer is the average of the $k_\text{low}$-th and $k_\text{high}$-th differences.
We adapt Monahan’s selection scheme:
- Per-row bounds. For every row $i$, maintain an active interval $[L_i,R_i]$ of columns $j$ (initially $L_i=i+1$, $R_i=n-1$ in $0$-based indices), representing candidates still in play.
- Pivot choice. Pick a “plausible” difference as pivot (e.g., the median element of a random active row). Randomized selection keeps the expected number of iterations logarithmic.
- Partition by counting. In a single sweep, count how many $D_{i,j}$ are strictly less than the pivot. Thanks to row monotonicity, this costs $O(n)$.
- Compare to target rank(s).
- If fewer than $k_\text{low}$ are below the pivot, the median is at or above the pivot.
- If more are below, the median is below the pivot.
- If the count lands exactly on $k_\text{low}$ or $k_\text{high}-1$, the neighboring values around the pivot give the answer directly.
- Shrink bounds. Update $[L_i,R_i]$ so that only entries consistent with the new half-space remain.
Repeat until the active set is tiny (then resolve by midrange of the remaining candidates).
Counting in Linear Time
With $y$ sorted, the predicate $y_j-y_i
Maintain a pointer $j$ that never moves left:
- For each row $i=1,\dots,n-1$, advance $j$ while $y_j-y_i
- Track the largest difference $
smallest difference $\ge p$ while sweeping; these are the immediate neighbors around the pivot and yield the exact median when the rank boundary is hit.
This delivers a full partition (and the boundary neighbors) in $O(n)$ time and $O(1)$ extra memory.
Handling Ties and Stalling
Discrete or tied data can cause the below-count to repeat across iterations.
Use midrange tie-breaking in the active set:
- Compute the smallest and largest active differences implied by current bounds $[L_i,R_i]$.
- Set the next pivot to their average.
- If min = max, all remaining candidates are identical — return it.
This guarantees convergence without sacrificing exactness.
Randomized Pivot Strategy
Weighted medians of row medians work but add overhead. Following Hoare’s FIND
and Monahan’s practice, pick:
- A random active row (weighted by its current length), then
- The median element of that row as the pivot.
This choice yields $O(\log n)$ expected iterations, so overall cost is sorting $O(n\log n)$ plus a handful of linear partitions.
Even and Odd Cases
For odd $N$, the median is a single matrix element.
For even $N$, return the average of the two middle elements, which are exactly the largest difference below the rank boundary and the smallest difference at/above it. The linear sweep computes both for free.
Complexity
- Sort once: $O(n\log n)$.
- Each partition: $O(n)$.
- Expected iterations: $O(\log n)$.
- Total: $O(n\log n)$ time, $O(n)$ extra space.
No quadratic arrays, no approximations.
Practical Notes
- Exactness. The algorithm returns the exact median of all $\binom{n}{2}$ differences. Randomness affects only runtime, not correctness.
- Duplicates & zeros. Fully supported — long flat regions do not break selection; midrange pivoting handles stalls.
- Scaling. If you need a normal-consistent scale, multiply by $c_\infty\approx 1.048358$ as shown above. This keeps the robust breakdown while matching $\sigma$ under Gaussian data.
- Extensions. The same machinery computes any quantile of pairwise differences (useful for robust confidence intervals or dispersion bands).
Conclusion
The Shamos estimator can be computed exactly in $O(n\log n)$ time by treating pairwise differences as a monotone implicit matrix and applying a Monahan-style selection procedure. You pay for one sort, then a small number of linear passes — no $n^2$ arrays, no lossy approximations.
Reference Implementation
public static double ShamosFast(IReadOnlyList<double> values, Random? rng = null, bool copyAndSort = true)
{
int n = values.Count;
if (n < 2) throw new ArgumentException("Shamos Spread requires at least two observations.", nameof(values));
if (n == 2) return Math.Abs(values[1] - values[0]);
// Prepare a sorted working copy.
double[] a = copyAndSort ? CopySorted(values) : EnsureSorted(values);
// Total number of pairwise differences with i < j
long N = (long)n * (n - 1) / 2;
long kLow = (N + 1) / 2; // 1-based rank of lower middle
long kHigh = (N + 2) / 2; // 1-based rank of upper middle
// Per-row active bounds over columns j (0-based indices).
// Row i allows j in [i+1, n-1] initially.
int[] L = new int[n];
int[] R = new int[n];
long[] rowCounts = new long[n]; // # of elements in row i that are < pivot (for current partition)
for (int i = 0; i < n; i++)
{
L[i] = Math.Min(i + 1, n); // n means empty
R[i] = n - 1; // inclusive
if (L[i] > R[i])
{
L[i] = 1;
R[i] = 0;
} // mark empty
}
rng ??= new Random(0xC0FFEE);
// A reasonable initial pivot: a central gap
double pivot = a[n / 2] - a[(n - 1) / 2];
long prevCountBelow = -1;
while (true)
{
// === PARTITION: count how many differences are < pivot; also track boundary neighbors ===
long countBelow = 0;
double largestBelow = double.NegativeInfinity; // max difference < pivot
double smallestAtOrAbove = double.PositiveInfinity; // min difference >= pivot
int j = 1; // global two-pointer (non-decreasing across rows)
for (int i = 0; i < n - 1; i++)
{
if (j < i + 1) j = i + 1;
while (j < n && a[j] - a[i] < pivot) j++;
long cntRow = j - (i + 1);
if (cntRow < 0) cntRow = 0;
rowCounts[i] = cntRow;
countBelow += cntRow;
// boundary elements for this row
if (cntRow > 0)
{
// last < pivot in this row is (j-1)
double candBelow = a[j - 1] - a[i];
if (candBelow > largestBelow) largestBelow = candBelow;
}
if (j < n)
{
double candAtOrAbove = a[j] - a[i];
if (candAtOrAbove < smallestAtOrAbove) smallestAtOrAbove = candAtOrAbove;
}
}
// === TARGET CHECK ===
// If we've split exactly at the middle, we can return using the boundaries we just found.
bool atTarget =
(countBelow == kLow) || // lower middle is the largest < pivot
(countBelow == (kHigh - 1)); // upper middle is the smallest >= pivot
if (atTarget)
{
if (kLow < kHigh)
{
// Even N: average the two central order stats.
return 0.5 * (largestBelow + smallestAtOrAbove);
}
else
{
// Odd N: pick the single middle depending on which side we hit.
bool needLargest = (countBelow == kLow);
return needLargest ? largestBelow : smallestAtOrAbove;
}
}
// === STALL HANDLING (ties / no progress) ===
if (countBelow == prevCountBelow)
{
// Compute min/max remaining difference in the ACTIVE set and pivot to their midrange.
double minActive = double.PositiveInfinity;
double maxActive = double.NegativeInfinity;
long active = 0;
for (int i = 0; i < n - 1; i++)
{
int Li = L[i], Ri = R[i];
if (Li > Ri) continue;
double rowMin = a[Li] - a[i];
double rowMax = a[Ri] - a[i];
if (rowMin < minActive) minActive = rowMin;
if (rowMax > maxActive) maxActive = rowMax;
active += (Ri - Li + 1);
}
if (active <= 0)
{
// No active candidates left: the only consistent answer is the boundary implied by counts.
// Fall back to neighbors from this partition.
if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
}
if (maxActive <= minActive) return minActive; // all remaining equal
double mid = 0.5 * (minActive + maxActive);
pivot = (mid > minActive && mid <= maxActive) ? mid : maxActive;
prevCountBelow = countBelow;
continue;
}
// === SHRINK ACTIVE WINDOW ===
// --- SHRINK ACTIVE WINDOW (fixed) ---
if (countBelow < kLow)
{
// Need larger differences: discard all strictly below pivot.
for (int i = 0; i < n - 1; i++)
{
// First j with a[j] - a[i] >= pivot is j = i + 1 + cntRow (may be n => empty row)
int newL = i + 1 + (int)rowCounts[i];
if (newL > L[i]) L[i] = newL; // do NOT clamp; allow L[i] == n to mean empty
if (L[i] > R[i])
{
L[i] = 1;
R[i] = 0;
} // mark empty
}
}
else
{
// Too many below: keep only those strictly below pivot.
for (int i = 0; i < n - 1; i++)
{
// Last j with a[j] - a[i] < pivot is j = i + cntRow (not cntRow-1!)
int newR = i + (int)rowCounts[i];
if (newR < R[i]) R[i] = newR; // shrink downward to the true last-below
if (R[i] < i + 1)
{
L[i] = 1;
R[i] = 0;
} // empty row if none remain
}
}
prevCountBelow = countBelow;
// === CHOOSE NEXT PIVOT FROM ACTIVE SET (weighted random row, then row median) ===
long activeSize = 0;
for (int i = 0; i < n - 1; i++)
{
if (L[i] <= R[i]) activeSize += (R[i] - L[i] + 1);
}
if (activeSize <= 2)
{
// Few candidates left: return midrange of remaining exactly.
double minRem = double.PositiveInfinity, maxRem = double.NegativeInfinity;
for (int i = 0; i < n - 1; i++)
{
if (L[i] > R[i]) continue;
double lo = a[L[i]] - a[i];
double hi = a[R[i]] - a[i];
if (lo < minRem) minRem = lo;
if (hi > maxRem) maxRem = hi;
}
if (activeSize <= 0) // safety net; fall back to boundary from last partition
{
if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
}
if (kLow < kHigh) return 0.5 * (minRem + maxRem);
return (Math.Abs((kLow - 1) - countBelow) <= Math.Abs(countBelow - kLow)) ? minRem : maxRem;
}
else
{
long t = NextIndex(rng, activeSize); // 0..activeSize-1
long acc = 0;
int row = 0;
for (; row < n - 1; row++)
{
if (L[row] > R[row]) continue;
long size = R[row] - L[row] + 1;
if (t < acc + size) break;
acc += size;
}
// Median column of the selected row
int col = (L[row] + R[row]) >> 1;
pivot = a[col] - a[row];
}
}
}
// --- Helpers ---
private static double[] CopySorted(IReadOnlyList<double> values)
{
var a = new double[values.Count];
for (int i = 0; i < a.Length; i++)
{
double v = values[i];
if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
a[i] = v;
}
Array.Sort(a);
return a;
}
private static double[] EnsureSorted(IReadOnlyList<double> values)
{
// Trust caller; still copy to array for fast indexed access.
var a = new double[values.Count];
for (int i = 0; i < a.Length; i++)
{
double v = values[i];
if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
a[i] = v;
}
return a;
}
private static long NextIndex(Random rng, long limitExclusive)
{
// Uniform 0..limitExclusive-1 even for large ranges.
// Use rejection sampling for correctness.
ulong uLimit = (ulong)limitExclusive;
if (uLimit <= int.MaxValue)
{
return rng.Next((int)uLimit);
}
while (true)
{
ulong u = ((ulong)(uint)rng.Next() << 32) | (uint)rng.Next();
ulong r = u % uLimit;
if (u - r <= ulong.MaxValue - (ulong.MaxValue % uLimit)) return (long)r;
}
}
References
Geometry and Statistics: Problems at the Interface · 1976
· Michael Ian Shamos
Algorithm 616: fast computation of the Hodges-Lehmann location estimator · 1984-08-28
· John F Monahan
Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm · 2025-09-23