\(\newcommand\u{\boldsymbol{u}}\newcommand\v{\boldsymbol{v}}\newcommand\A{\boldsymbol{A}}\newcommand\inner[1]{\langle #1 \rangle}\newcommand\R{\mathbb{R}}\newcommand\P{\mathbf{P}}\newcommand\E{\mathbf{E}}\newcommand\TV{\mathsf{TV}}\newcommand\e{\boldsymbol{e}}\newcommand\Bernoulli{\operatorname{Bernoulli}}\newcommand\a{\boldsymbol{a}}\newcommand\argmin{\operatorname*{arg\,min}}\)In the Maximum Inner Product Search (\(\mathsf{MIPS}\)) problem, we are given a collection of vectors \(\u_1,\u_2,\ldots,\u_n\in\R^d\) and a vector \(\v\in\R^d\), and asked to return an index \(i_\star\in[n]\) so that \(\inner{\v,\u_{i_\star}} = \max_{i\in[n]}\inner{\v,\u_i}\). Naively computing each inner product \(\inner{\v,\u_i}\) and reporting the maximizer takes \(\mathcal{O}(nd)\) floating point operations, which can be excessively large in certain circumstances, and so there has been much work in trying to reduce the runtime of this problem by either preprocessing or sub-sampling coordinates to get an approximate solution. For instance, Liu et al. [1] gave a randomized method which samples values \(\v_j\u_{ij}\) adaptively to give a solution \(\hat i_\star\) in \(\mathcal{O}(n\sqrt{d})\) time satisfying \(\inner{\v,\u_{\hat i_\star}} \geq \max_{i\in[n]}\inner{\v,\u_{i}} - d\epsilon\) with high probability. One might hope to give an approximate solution \(\hat i_\star\) satisfying the more reasonable relative error bound \(\inner{\v,\u_{\hat i_\star}} \geq (1-\epsilon)\max_{i\in[n]}\inner{\v,\u_i}\) without dimensional dependence in the error, but this post shows that is impossible: any potentially randomized and adaptive algorithm to solve the \(\mathsf{MIPS}\) problem to \(\epsilon\)-relative error without pre-processing and failure probability \(\delta < 1/20\) must query \(\Omega(nd)\) coordinates \(\u_{ij}\). This lower bound holds even if the input vectors \(\u_1,\u_2,\ldots,\u_n,\v \in\{0,1\}^d\) are all binary.
Definition: [\(\mathsf{MIPS}\) Solver] A \(\mathsf{MIPS}\) solver is a randomized algorithm for computing $$i_\star \in \argmin_{i\in[n]}\inner{\v,\u_i}$$ via access to the oracle \(\u_{ij}\). The algorithm \(h = (f, g, q)\) is specified by two functions and a number of queries to make. The algorithm maintains as state a set \(Q \subseteq [d]\times[n]\times\R\times\R\) of listed queries and their associated responses. Upon initialization, \(Q = \varnothing\) is empty. At each step \(k=1,2,\ldots,q\) of the algorithm we sample indices \((i,j) = f(Q)\) whose distribution is determined based on our state and update \(Q = Q \cup (i, j, \v_j, \u_{ij})\). After step \(q\) of the algorithm, we output \(\hat i_\star = h(\u_1,\ldots,\u_n,\v) = g(Q)\).
A 'good' \(\mathsf{MIPS}\) solver is one which outputs small relative error with high probability after few queries \(q\).Definition: An \((\epsilon,\delta)\)-\(\mathsf{MIPS}\) solver is a \(\mathsf{MIPS}\) solver \(h\) which satisfies \[ \P\bigl(\inner{\v,\u_{\hat i_\star}} \geq (1-\epsilon) \max_{i\in[n]} \inner{\v,\u_i}\bigr) \geq 1-\delta \] for any input vectors \(\u_1,\u_2,\ldots,\u_n,\v\in\R^d\) in some family. Usually we will take the family to be the set of all input vectors whose components are bounded in magnitude by some constant \(K\). The minimal number of queries \(q\) used to achieve this with-high-probability relative error condition is called the sample complexity of the \(\mathsf{MIPS}\) solver \(h\).
Definition: [Total Variation Distance] Let \(P\) and \(Q\) be probability measures defined on a sigma algebra \(\Sigma\subseteq2^\Omega\). The total variation distance between \(P\) and \(Q\) is defined to be \[ d_{\TV}(P,Q) = \sup_{A\in \Sigma} |P(A) - Q(A)|, \] the maximal distance between the measures of a single event. We can also write \[ d_{\TV}(P,Q) = \frac{1}{2}\|dP-dQ\|_1 \] whenever the Radon-Nikodym theorem gives us density functions \(dP\) and \(dQ\) for the respective measures.
The upcoming Lemma gives us the desired lower bound on the failure probability of a decision algorithm trying to distinguish a sample from \(P\) or \(Q\). We highlight that this result is not new, just that I could not find a proof explicly given anywhere. For a statement of this result elsewhere, see [2, Prop. 1].Lemma 1: Let \(P\) and \(Q\) be probability measures defined on a sigma algebra \(\Sigma\subseteq2^\Omega\). Suppose we are given a sample \(x\in\Omega\) which is taken from either \(P\) or \(Q\). Which distribution \(x\) comes from is chosen randomly with equal odds. The failure probability of any algorithm which takes in \(x\) and distinguishes whether it came from \(P\) or \(Q\) is at least \(\tfrac{1}{2} - \tfrac{1}{2}d_{\TV}(P,Q)\).
Note that this Lemma is trivial in the degenerate case that \(P=Q\), since then the sample doesn't give any information and the best we can do is guess. This has an error probability of \(1/2\) since the underlying distribution is chosen equiprobably. Similarly, if \(P\) and \(Q\) have disjoint support, \(d_{\TV}(P,Q) = 1\) and the sample \(x\) comes almost surely from \(P\) if \(x\in\operatorname{supp}(P)\) and similarly for \(Q\). Thus the failure probability is \(0 = \tfrac{1}{2} - \tfrac{1}{2}d_{\TV}(P,Q)\) and the bound is tight from both directions.Proof: [Proof of Lemma 1] First assume that the decision algorithm is deterministic. Any such decision algorithm which takes in a sample \(x\) and decides whether it came from \(P\) or \(Q\) can be represented as a function \(f : \Omega \to \{0,1\}\), where \(f(x) = 1\) characterizes the algorithm asserting that the sample \(x\) came from the distribution \(P\). Let \(A_P = \{x\in\Omega : f(x) = 1\}\) and \(A_Q = \Omega - A_P\). The probability our algorithm succeeds is \begin{align*} p = \frac{1}{2}\P(x \in A_P | x\sim P) + \frac{1}{2}\P(x\in A_Q | x\sim Q) &= \frac{1}{2}\int_{A_P} dP + \frac{1}{2}\int_{A_Q}dQ\\ &= \frac{1}{2}\int_{A_P} dP - \frac{1}{2}\int_{A_P}dQ + \frac{1}{2}\int_{A_Q}dQ + \frac{1}{2}\int_{A_P}dQ\\ &= \frac{1}{2} + \frac{1}{2} \int_{A_P}\bigl(dP - dQ\bigr). \end{align*} By two applications of the triangle inequality, \[ p \leq \frac{1}{2} + \frac{1}{2}\int_{A_P} \bigl|dP - dQ\bigr| \leq \frac{1}{2} + \frac{1}{2}\int_{\Omega}\bigl|dP - dQ\bigr| = \frac{1}{2} + \frac{1}{2}d_{\TV}(P,Q) \] as desired. If, on the other hand, our decision algorithm is random, we can identify it with a family \(\{f_\alpha\}_{\alpha\in\Gamma}\) of deterministic decision functions and an underlying measure \(\mu\) on \(\Gamma\). The algorithm samples a function \(f_\alpha\) according to \(\mu\) before returning the decision corresponding to \(f_\alpha(x)\). The success probability of this algorithm is \[ q = \E_{\alpha} \P(f_\alpha(x)\text{ is correct}| \alpha) \leq \E_{\alpha}\bigl(\frac{1}{2} + \frac{1}{2}d_{\TV}(P,Q)\bigr) = \frac{1}{2} + \frac{1}{2}d_{\TV}(P,Q) \] as in the deterministic case. Thus the failure probability \[ 1-q \geq \frac{1}{2} - \frac{1}{2}d_{\TV}(P,Q) \] and the proof is complete.
Definition: [Distinguishing Matrices] In this setting we are given two distributions over matrices of dimension \(d\times n\) where \(n\) is even:
Lemma 2: Let \(\epsilon < 1\), \(k \geq \log\tfrac{4}{\delta}\), and \(nd\) sufficiently large. Any \((\epsilon,\delta/2)\)-\(\mathsf{MIPS}\) solver using at most \(q\) queries provides an algorithm for solving \(k\)-\(\mathsf{DIST}\) with failure probability at most \(\delta\) in the same number of samples.
Proof: Let \(\A\) be our sample from the \(\mathsf{DIST}\) problem, and and let \(\hat i_\star\) be the result of our \(\mathsf{MIPS}\) solver applied with \(\u_i\) being the \(i\)-th column of \(\A\) and \(\v=\e\) being the vector of all ones. \(\A\) will have a value of \(1\) somewhere in the matrix with probability \(1 - (1 - \tfrac{2k}{nd})^{nd/2} \geq 1 - 2e^{-k} \geq 1-\delta/2\) for sufficiently large \(nd\). In this case that a one exists, the returned index \(\hat i_\star\) will satisfy $$\inner{\v,\u_{\hat i_\star}} \geq (1-\epsilon)\max_{i\in[n]}\inner{\v,\u_i} \geq (1-\epsilon) > 0,$$ with probability at least \(1-\delta/2\). In other words, the \(\hat i_\star\)-th column of \(\A\) has a one in it except with probability at most \(\delta/2 + \delta/2 = \delta\). In the case that \(\A\sim P\), the only ones in \(\A\) can occur for columns \(i\) with \(i \leq n/2\). On the other hand, if \(\A\sim Q\) then the only ones in \(\A\) can occur at column \(i\) with \(i > n/2\). Thus, the function \[ D(\A) = \begin{cases} P & \text{if \(h(\A_{:,1},\A_{:,2},\ldots,\A_{:,n},\e) \leq n/2\)}\\ Q & \text{otherwise} \end{cases} \] will return the correct answer to \(k\)-\(\mathsf{DIST}\) with probability at least \(1-\delta\), as desired.
Thus, to prove that an \((\epsilon,\delta)\)-\(\mathsf{MIPS}\) solver needs \(\Omega(nd)\) queries we can show that this many queries are needed to solve \(k\)-\(\mathsf{DIST}\) in the family. This relies on Lemma 1.Theorem 3: Any algorithm used to solve \(k\)-\(\mathsf{DIST}\) with failure probability at most \(\delta\) needs to use at least $$q \geq \frac{\log\tfrac{1}{2\delta}}{\log\tfrac{1}{1-\tfrac{2k}{nd}}}$$ queries to the oracle. For \(k = o(nd)\) and sufficiently large \(nd\), this implies the number of queries $$q \geq \frac{nd}{4k}\log\frac{1}{2\delta}.$$
Proof: Suppose we have an algorithm which can solve \(k\)-\(\mathsf{DIST}\) with failure probability at most \(\delta\) using no more than \(q\) queries to the oracle. Conditional on the query sequence \((i_1,j_1),\ldots,(i_q,j_q)\) made by the algorithm on an input, the set of oracle responses are equivalent in distribution to a sample from the distribution \(\Bernoulli(0)^{\otimes q_1} \otimes \Bernoulli(\tfrac{2k}{nd})^{\otimes q_2}\) where \(q_1\leq q\) is the number of queries with \(i_\ell \leq n/2\) if \(\A\sim P\) or \(i_\ell > n/2\) is \(\A\sim Q\), and \(q_2 = q - q_1\). In particular, \(\max\{q_1,q_2\} \leq q\), and so unconditionally on the query sequence, any algorithm for this problem must fail with probability at least as high as the smallest failure probability of an algorithm which distinguishes from the following two distributions over length \(2q\) bit vectors via one sample chosen uniformly at random from \(P'\) or \(Q'\):
Theorem 4: Suppose \(\epsilon < 1\) and fix \(\delta < 1/20\). Any \((\epsilon,\delta)\)-\(\mathsf{MIPS}\) solver requires greater than \(\tfrac{nd}{16}\) queries to the sampling oracle if \(nd\) is sufficiently large.
Proof: By Lemma 2, an \((\epsilon,\delta/2)\)-\(\mathsf{MIPS}\) solver using at most \(q\) sampling oracle queries can be used to generate an algorithm for \(k\)-\(\mathsf{DIST}\) which uses at most \(q\) oracle queries and fails with probability at most \(\delta\), so long as \(k \geq \log\tfrac{4}{\delta}\) and \(nd\) is large enough. In particular, the algorithm solves \(\lceil \log \tfrac{4}{\delta}\rceil\)-\(\mathsf{DIST}\) with these properties for large enough \(nd\). Since \(\delta\) is a constant, Theorem 3 ensures that this \((\epsilon,\delta/2)\)-\(\mathsf{MIPS}\) solver requires \[ q \geq \frac{nd}{4k}\log\frac{1}{2\delta} = \frac{nd}{4}\frac{\log\tfrac{1}{\delta} + \log\tfrac{1}{2}}{\lceil\log\tfrac{1}{\delta} + \log 4\rceil} \geq \frac{nd}{16} \] so long as \(nd\) is large enough and \(\log\tfrac{1}{\delta}\geq 2\), or \(\delta \leq e^{-2}\). Observing that \(1/20 < e^{-2}/2\) and replacing \(\delta/2\) with \(\delta\) in the bound gives the reported result.
The work of Liu et al. [1] showed that one could get an additive \(d\epsilon\)-approximate solution to the \(\mathsf{MIPS}\) problem with high probability via \(\mathcal{O}(n\sqrt{d})\) samples to coordinates \(\u_{ij}\). One might hope to get a sublinear time preprocessing-free \(\mathsf{MIPS}\) solver whose error does not scale with the underlying dimension. This work shows that this is impossible, at least for \(\epsilon\)-relative error: any \(\mathsf{MIPS}\) solver must sample at least \(\Omega(nd)\) of the \(\mathcal{O}(nd)\) available coordinates \(\u_{ij}\), even if the coordinates are all bounded in the interval \([0,1]\).
The question of whether an \(\mathcal{O}(n\sqrt{d})\) time solution to \(\mathsf{MIPS}\) with an additive error \(d\epsilon\) is optimal is still open, however. This is left for future work.
Date: August 11, 2019