It sounds like an inverted index would still be useful to you. It's a map of words onto ordered lists of documents containing them. The documents need to have a common, total ordering, and it is in this order in which they appear in their per-word buckets.
Assuming your n
is total length of the corpus in word occurrences (and not vocabulary size), it can be constructed in O(n log n)
time and linear space.
Given P1
and P2
, you make two separate queries to get the documents containing the two terms respectively. Since the two lists share a common ordering, you can do a linear merge-like algorithm and select just those documents containing P1
but not P2
:
c1 <- cursor to first element of list of docs containing P1
c2 <- cursor to first element of list of docs containing P2
results <- [] # our return value
while c1 not exhausted
if c2 exhausted or *c1 < *c2
results.append(c1++)
else if *c1 == *c2
c1++
c2++
else # *c1 > *c2
c2++
return results
Notice every pass of the loop iterates at least one cursor; it runs in linear time in the sum of the sizes of the two initial queries. Since only things from the c1
cursor enter results
, we know all results contain P1
.
Finally, note we always advance only the "lagging" cursor: this (and the total document ordering) guarantees that if a document appears in both initial queries, there will be a loop iteration where both cursors point to that document. When this iteration occurs, the middle clause necessarily kicks in and the document is skipped by advancing both cursors. Thus documents containing P2
necessarily do not get added to results
.
This query is an example of a general class called Boolean queries; it's possible to extend this algorithm to cover most any boolean expression. Certain queries break the efficiency of the algorithm (by forcing it to walk over the entire vocabulary space) but basically so long as you don't negate each term (i.e. don't ask for not P1 and not P2
) you're fine. See this for an in-depth treatment.