Matching quantile sets using likelihood based on the binomial coefficients


Let’s say we have a distribution \(X\) that is given by its \(s\)-quantile values:

\[q_{X_1} = Q_X(p_1),\; q_{X_2} = Q_X(p_2),\; \ldots,\; q_{X_{s-1}} = Q_X(p_{s-1}) \]

where \(Q_X\) is the quantile function of \(X\), \(p_j = j / s\).

We also have a sample \(y = \{y_1, y_2, \ldots, y_n \}\) that is given by its \(s\)-quantile estimations:

\[q_{y_1} = Q_y(p_1),\; q_{y_2} = Q_y(p_2),\; \ldots,\; q_{y_{s-1}} = Q_y(p_{s-1}), \]

where \(Q_y\) is the quantile estimation function for sample \(y\). We also assume that \(q_{y_0} = \min(y)\), \(q_{y_s} = \max(y)\).

We want to know the likelihood of “\(y\) is drawn from \(X\)”. In this post, I want to suggest a nice way to do this using the binomial coefficients.

The \(q_{X_j}\) quantile values split all the real numbers into \(s\) intervals:

\[(-\infty; q_{X_1}],\; (q_{X_1}; q_{X_2}],\; \ldots,\; (q_{X_{s-2}}; q_{X_{s-1}}],\; (q_{X_{s-1}}; \infty) \]

If we introduce \(q_{X_0} = -\infty\), \(q_{X_s}=\infty\), we can describe the \(j^\textrm{th}\) interval as \((q_{X_{j-1}}; q_{X_{j}}]\). Each of such intervals contains exactly \(1/s\) portion of the whole distribution:

\[F_X(q_{X_j}) - F_X(q_{X_{j-1}}) = 1/s, \]

where \(F_X\) is the CDF of \(X\).

If we knew the elements of sample \(y\), we would be able to match \(y_i\) to the corresponding intervals. Let’s consider another sample \(z = \{ z_1, z_2, \ldots, z_n \}\) where \(z_i\) is the index of the interval that contains \(z_i\):

\[z_i = \sum_{j=1}^{s} j \cdot \mathbf{1} \{ q_{X_{j-1}} < y_i \leq q_{X_j} \} \quad (i \in \{ 1..n\}). \]

Let \(k_j\) be the number of \(y\) elements that belong to the \(j^\textrm{th}\) segment:

\[k_j = \sum_{i=1}^n \mathbf{1} \{ z_i = j \} \quad (j \in \{ 1..s\}). \]

Unfortunately, we don’t know the exact values of \(y_i\), we know only \(s\)-quantile estimations \(q_{y_j}\). For this case, we could estimate the \(k_j\) values using linear interpolation between known quantiles:

\[k_j = n \cdot \sum_{l=1}^{s} \frac{1}{s} \frac{\max(0, \min(q_{y_l}, q_{X_j}) - \max(q_{y_{l-1}}, q_{X_{j-1}}) )}{q_{y_l} - q_{y_{l-1}}} \quad (j \in \{ 1..s\}). \]

(Here we assume that \(q_{y_l} \neq q_{y_{l-1}}\), such cases should be handled separately.)

Now we transform the original problem into the new one: what’s the likelihood of observing sample \(\{ z_i \}\) described by \(\{ k_j \}\) given that \(z_i\) is a random number from \(\{ 1, 2, \ldots, s \}\).

This is a simple combinatorial problem. The total number of different \(z\) samples is \(s^n\) (we have \(n\) elements, each element is one of \(s\) values). The number of ways to choose \(k_1\) elements that equal \(1\) is \(C_n^{k_1}\). Once we remove these elements from consideration, the number of ways to choose \(k_2\) elements that equal \(2\) is \(C_{n-k_1}^{k_2}\). If we continue this process, we will get the following equation for the likelihood:

\[\mathcal{L} = \frac{ C_n^{k_1} \cdot C_{n-k_1}^{k_2} \cdot C_{n-k_1-k_2}^{k_3} \cdot \ldots \cdot C_{n-k_1-k_2-\ldots-k_{s-1}}^{k_s}}{s^n} \]

Note that \(C_n^k\) usually assumes that \(n\) and \(k\) are integers. However, in our approach with the linear interpolation, \(k_j\) are real numbers. To work around this limitation, we could consider a generalization of \(C_n^k\) on real numbers using the gamma function \(\Gamma(n) = (n-1)!\):

\[C_n^k = \frac{n!}{k!(n-k)!} = \frac{\Gamma(n+1)}{\Gamma(k+1)\Gamma(n-k+1)} \]

In practice, it’s pretty hard to “honestly” calculate the likelihood using the above formulate so we switch to the log-likelihood notation:

\[\log\mathcal{L} = \log C_n^{k_1} + \log C_{n-k_1}^{k_2} + \log C_{n-k_1-k_2}^{k_3} + \ldots + \log C_{n-k_1-k_2-\ldots-k_{s-1}}^{k_s} - n \log s \]

It’s easy to see that \(\log C_n^k\) could be easily expressed via the log-gamma function:

\[\log C_n^k = \log \frac{\Gamma(n+1)}{\Gamma(k+1)\Gamma(n-k+1)} = \log\Gamma(n+1) - \log\Gamma(k+1) - \log\Gamma(n-k+1). \]

This log-likelihood could be used in various applications where we need a way to match sets of quantiles between each other. In one of the future posts, we will learn to apply this approach to change point detection.

The source code of this post and all the relevant files are available on GitHub.