Fast Computation of the Shamos Estimator via Monahan-Style Selection

Andrey Akinshin · 2025-09-30

The Shamos scale estimator ( Geometry and Statistics: Problems at the Interface
By Michael Ian Shamos · 1976
shamos1976
) is the median of all pairwise absolute differences.
Given a sample $\mathbf{x}=(x_1,\dots,x_n)$, define

$$ \operatorname{Shamos}(\mathbf{x}) = \underset{1 \leq i < j \leq n}{\operatorname{Median}} |x_i - x_j| $$

It is robust to outliers and easy to interpret. The usual Gaussian-consistent version multiplies $\operatorname{Shamos}$ by

$$ c_\infty=\frac{1}{\sqrt{2}\,\Phi^{-1}(0.75)}\approx 1.048358, $$

so that $c_\infty\cdot\operatorname{Shamos}$ estimates the standard deviation under normal data.

A naïve algorithm enumerates all $N=\tfrac{n(n-1)}{2}$ pairs, stores their absolute differences, and sorts them: $O(n^2)$ memory and $O(n^2\log n)$ time. For $n=10{,}000$, that’s $\sim50$ million differences — wasteful.

We can do better. The same structural trick that accelerates Hodges–Lehmann ( Algorithm 616: fast computation of the Hodges-Lehmann location estimator
By John F Monahan · 1984-08-28
monahan1984
, Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm
By Andrey Akinshin · 2025-09-23
Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm
) also yields an exact $O(n\log n)$ algorithm for Shamos.

The Problem Structure

Sort once: $y_1\le \cdots \le y_n$. Then $|x_i-x_j|$ with $idifference $y_j-y_i\ge 0$.

Consider the implicit upper-triangular matrix

$$ D_{i,j} = y_j - y_i,\qquad 1\le iKey properties:

  • Row monotonicity. For fixed $i$, $D_{i,i+1} \le D_{i,i+2} \le \cdots \le D_{i,n}$.
  • Column monotonicity (in $i$). For fixed $j$, as $i$ increases, $y_j-y_i$ decreases.
  • No need to materialize $D$. We can count how many entries are below a pivot in $O(n)$ by a single two-pointer sweep across rows.

The median of all pairwise absolute differences is simply the median of this implicit matrix.

Selection Instead of Sorting

We seek the $k$-th order statistic among the $N=\tfrac{n(n-1)}{2}$ differences. Let

$$ k_\text{low}=\left\lfloor\frac{N+1}{2}\right\rfloor,\qquad k_\text{high}=\left\lfloor\frac{N+2}{2}\right\rfloor. $$

For even $N$ the answer is the average of the $k_\text{low}$-th and $k_\text{high}$-th differences.

We adapt Monahan’s selection scheme:

  1. Per-row bounds. For every row $i$, maintain an active interval $[L_i,R_i]$ of columns $j$ (initially $L_i=i+1$, $R_i=n-1$ in $0$-based indices), representing candidates still in play.
  2. Pivot choice. Pick a “plausible” difference as pivot (e.g., the median element of a random active row). Randomized selection keeps the expected number of iterations logarithmic.
  3. Partition by counting. In a single sweep, count how many $D_{i,j}$ are strictly less than the pivot. Thanks to row monotonicity, this costs $O(n)$.
  4. Compare to target rank(s).
    • If fewer than $k_\text{low}$ are below the pivot, the median is at or above the pivot.
    • If more are below, the median is below the pivot.
    • If the count lands exactly on $k_\text{low}$ or $k_\text{high}-1$, the neighboring values around the pivot give the answer directly.
  5. Shrink bounds. Update $[L_i,R_i]$ so that only entries consistent with the new half-space remain.

Repeat until the active set is tiny (then resolve by midrange of the remaining candidates).

Counting in Linear Time

With $y$ sorted, the predicate $y_j-y_i Maintain a pointer $j$ that never moves left:

  • For each row $i=1,\dots,n-1$, advance $j$ while $y_j-y_i
  • Track the largest difference $smallest difference $\ge p$ while sweeping; these are the immediate neighbors around the pivot and yield the exact median when the rank boundary is hit.

This delivers a full partition (and the boundary neighbors) in $O(n)$ time and $O(1)$ extra memory.

Handling Ties and Stalling

Discrete or tied data can cause the below-count to repeat across iterations.
Use midrange tie-breaking in the active set:

  • Compute the smallest and largest active differences implied by current bounds $[L_i,R_i]$.
  • Set the next pivot to their average.
  • If min = max, all remaining candidates are identical — return it.

This guarantees convergence without sacrificing exactness.

Randomized Pivot Strategy

Weighted medians of row medians work but add overhead. Following Hoare’s FIND and Monahan’s practice, pick:

  • A random active row (weighted by its current length), then
  • The median element of that row as the pivot.

This choice yields $O(\log n)$ expected iterations, so overall cost is sorting $O(n\log n)$ plus a handful of linear partitions.

Even and Odd Cases

For odd $N$, the median is a single matrix element.
For even $N$, return the average of the two middle elements, which are exactly the largest difference below the rank boundary and the smallest difference at/above it. The linear sweep computes both for free.

Complexity

  • Sort once: $O(n\log n)$.
  • Each partition: $O(n)$.
  • Expected iterations: $O(\log n)$.
  • Total: $O(n\log n)$ time, $O(n)$ extra space.
    No quadratic arrays, no approximations.

Practical Notes

  • Exactness. The algorithm returns the exact median of all $\binom{n}{2}$ differences. Randomness affects only runtime, not correctness.
  • Duplicates & zeros. Fully supported — long flat regions do not break selection; midrange pivoting handles stalls.
  • Scaling. If you need a normal-consistent scale, multiply by $c_\infty\approx 1.048358$ as shown above. This keeps the robust breakdown while matching $\sigma$ under Gaussian data.
  • Extensions. The same machinery computes any quantile of pairwise differences (useful for robust confidence intervals or dispersion bands).

Conclusion

The Shamos estimator can be computed exactly in $O(n\log n)$ time by treating pairwise differences as a monotone implicit matrix and applying a Monahan-style selection procedure. You pay for one sort, then a small number of linear passes — no $n^2$ arrays, no lossy approximations.

Reference Implementation

public static double ShamosFast(IReadOnlyList<double> values, Random? rng = null, bool copyAndSort = true)
{
    int n = values.Count;
    if (n < 2) throw new ArgumentException("Shamos Spread requires at least two observations.", nameof(values));
    if (n == 2) return Math.Abs(values[1] - values[0]);

    // Prepare a sorted working copy.
    double[] a = copyAndSort ? CopySorted(values) : EnsureSorted(values);

    // Total number of pairwise differences with i < j
    long N = (long)n * (n - 1) / 2;
    long kLow = (N + 1) / 2; // 1-based rank of lower middle
    long kHigh = (N + 2) / 2; // 1-based rank of upper middle

    // Per-row active bounds over columns j (0-based indices).
    // Row i allows j in [i+1, n-1] initially.
    int[] L = new int[n];
    int[] R = new int[n];
    long[] rowCounts = new long[n]; // # of elements in row i that are < pivot (for current partition)

    for (int i = 0; i < n; i++)
    {
        L[i] = Math.Min(i + 1, n); // n means empty
        R[i] = n - 1; // inclusive
        if (L[i] > R[i])
        {
            L[i] = 1;
            R[i] = 0;
        } // mark empty
    }

    rng ??= new Random(0xC0FFEE);

    // A reasonable initial pivot: a central gap
    double pivot = a[n / 2] - a[(n - 1) / 2];

    long prevCountBelow = -1;

    while (true)
    {
        // === PARTITION: count how many differences are < pivot; also track boundary neighbors ===
        long countBelow = 0;
        double largestBelow = double.NegativeInfinity; // max difference < pivot
        double smallestAtOrAbove = double.PositiveInfinity; // min difference >= pivot

        int j = 1; // global two-pointer (non-decreasing across rows)
        for (int i = 0; i < n - 1; i++)
        {
            if (j < i + 1) j = i + 1;
            while (j < n && a[j] - a[i] < pivot) j++;

            long cntRow = j - (i + 1);
            if (cntRow < 0) cntRow = 0;
            rowCounts[i] = cntRow;
            countBelow += cntRow;

            // boundary elements for this row
            if (cntRow > 0)
            {
                // last < pivot in this row is (j-1)
                double candBelow = a[j - 1] - a[i];
                if (candBelow > largestBelow) largestBelow = candBelow;
            }

            if (j < n)
            {
                double candAtOrAbove = a[j] - a[i];
                if (candAtOrAbove < smallestAtOrAbove) smallestAtOrAbove = candAtOrAbove;
            }
        }

        // === TARGET CHECK ===
        // If we've split exactly at the middle, we can return using the boundaries we just found.
        bool atTarget =
            (countBelow == kLow) || // lower middle is the largest < pivot
            (countBelow == (kHigh - 1)); // upper middle is the smallest >= pivot

        if (atTarget)
        {
            if (kLow < kHigh)
            {
                // Even N: average the two central order stats.
                return 0.5 * (largestBelow + smallestAtOrAbove);
            }
            else
            {
                // Odd N: pick the single middle depending on which side we hit.
                bool needLargest = (countBelow == kLow);
                return needLargest ? largestBelow : smallestAtOrAbove;
            }
        }

        // === STALL HANDLING (ties / no progress) ===
        if (countBelow == prevCountBelow)
        {
            // Compute min/max remaining difference in the ACTIVE set and pivot to their midrange.
            double minActive = double.PositiveInfinity;
            double maxActive = double.NegativeInfinity;
            long active = 0;

            for (int i = 0; i < n - 1; i++)
            {
                int Li = L[i], Ri = R[i];
                if (Li > Ri) continue;

                double rowMin = a[Li] - a[i];
                double rowMax = a[Ri] - a[i];
                if (rowMin < minActive) minActive = rowMin;
                if (rowMax > maxActive) maxActive = rowMax;
                active += (Ri - Li + 1);
            }

            if (active <= 0)
            {
                // No active candidates left: the only consistent answer is the boundary implied by counts.
                // Fall back to neighbors from this partition.
                if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
                return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
            }

            if (maxActive <= minActive) return minActive; // all remaining equal

            double mid = 0.5 * (minActive + maxActive);
            pivot = (mid > minActive && mid <= maxActive) ? mid : maxActive;
            prevCountBelow = countBelow;
            continue;
        }

        // === SHRINK ACTIVE WINDOW ===
// --- SHRINK ACTIVE WINDOW (fixed) ---
        if (countBelow < kLow)
        {
            // Need larger differences: discard all strictly below pivot.
            for (int i = 0; i < n - 1; i++)
            {
                // First j with a[j] - a[i] >= pivot is j = i + 1 + cntRow (may be n => empty row)
                int newL = i + 1 + (int)rowCounts[i];
                if (newL > L[i]) L[i] = newL; // do NOT clamp; allow L[i] == n to mean empty
                if (L[i] > R[i])
                {
                    L[i] = 1;
                    R[i] = 0;
                } // mark empty
            }
        }
        else
        {
            // Too many below: keep only those strictly below pivot.
            for (int i = 0; i < n - 1; i++)
            {
                // Last j with a[j] - a[i] < pivot is j = i + cntRow  (not cntRow-1!)
                int newR = i + (int)rowCounts[i];
                if (newR < R[i]) R[i] = newR; // shrink downward to the true last-below
                if (R[i] < i + 1)
                {
                    L[i] = 1;
                    R[i] = 0;
                } // empty row if none remain
            }
        }

        prevCountBelow = countBelow;

        // === CHOOSE NEXT PIVOT FROM ACTIVE SET (weighted random row, then row median) ===
        long activeSize = 0;
        for (int i = 0; i < n - 1; i++)
        {
            if (L[i] <= R[i]) activeSize += (R[i] - L[i] + 1);
        }

        if (activeSize <= 2)
        {
            // Few candidates left: return midrange of remaining exactly.
            double minRem = double.PositiveInfinity, maxRem = double.NegativeInfinity;
            for (int i = 0; i < n - 1; i++)
            {
                if (L[i] > R[i]) continue;
                double lo = a[L[i]] - a[i];
                double hi = a[R[i]] - a[i];
                if (lo < minRem) minRem = lo;
                if (hi > maxRem) maxRem = hi;
            }

            if (activeSize <= 0) // safety net; fall back to boundary from last partition
            {
                if (kLow < kHigh) return 0.5 * (largestBelow + smallestAtOrAbove);
                return (countBelow >= kLow) ? largestBelow : smallestAtOrAbove;
            }

            if (kLow < kHigh) return 0.5 * (minRem + maxRem);
            return (Math.Abs((kLow - 1) - countBelow) <= Math.Abs(countBelow - kLow)) ? minRem : maxRem;
        }
        else
        {
            long t = NextIndex(rng, activeSize); // 0..activeSize-1
            long acc = 0;
            int row = 0;
            for (; row < n - 1; row++)
            {
                if (L[row] > R[row]) continue;
                long size = R[row] - L[row] + 1;
                if (t < acc + size) break;
                acc += size;
            }

            // Median column of the selected row
            int col = (L[row] + R[row]) >> 1;
            pivot = a[col] - a[row];
        }
    }
}

// --- Helpers ---

private static double[] CopySorted(IReadOnlyList<double> values)
{
    var a = new double[values.Count];
    for (int i = 0; i < a.Length; i++)
    {
        double v = values[i];
        if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
        a[i] = v;
    }

    Array.Sort(a);
    return a;
}

private static double[] EnsureSorted(IReadOnlyList<double> values)
{
    // Trust caller; still copy to array for fast indexed access.
    var a = new double[values.Count];
    for (int i = 0; i < a.Length; i++)
    {
        double v = values[i];
        if (double.IsNaN(v)) throw new ArgumentException("NaN not allowed.", nameof(values));
        a[i] = v;
    }

    return a;
}

private static long NextIndex(Random rng, long limitExclusive)
{
    // Uniform 0..limitExclusive-1 even for large ranges.
    // Use rejection sampling for correctness.
    ulong uLimit = (ulong)limitExclusive;
    if (uLimit <= int.MaxValue)
    {
        return rng.Next((int)uLimit);
    }

    while (true)
    {
        ulong u = ((ulong)(uint)rng.Next() << 32) | (uint)rng.Next();
        ulong r = u % uLimit;
        if (u - r <= ulong.MaxValue - (ulong.MaxValue % uLimit)) return (long)r;
    }
}

References

Geometry and Statistics: Problems at the Interface · 1976 · Michael Ian Shamos
Algorithm 616: fast computation of the Hodges-Lehmann location estimator · 1984-08-28 · John F Monahan
Fast Computation of the Hodges–Lehmann Estimator with the Monahan Algorithm · 2025-09-23