Fast and efficient computation on arrays

Question

I want to count the number of occurances for a particular phrase in a document. For example "stackoverflow forums". Suppose D represents the documents set with document containing both terms.

Now, suppose I have the following data structure:

A[numTerms][numMatchedDocuments][numOccurInADocument]

where numMatchedDocuments is the size of D and numOccurInADocument is the number of occurrences a particular term occurs in a particular document, for example:

A[stackoverflow][document1][occurance1]=3;

means, the term "stackoverflow" occurs in document "document1" and its first occurance is at position "3".

Then I pick the term that occur the least and loop over all its positions to find if "forum" occurs at a position+1 the current term "stackoverflow" positions. In other words, if I find "forum" at position 4 then that is a phrase and I've found a match for it.

the matching is straightforward per document and runs reasonably fast but when the number of documents exceed 2,000,000 it gets very slow. I've distributed it over cores and it gets faster of course but wonder if there is algorithmically better way of doing this.

thanks,

Psudo-Code:

boolean docPhrase=true;
int numOfTerms=2;
// 0 for "stackoverflow" and 1 for "forums"
for (int d=0;d<D.size();d++){
 //D is a set containing the matched documents
 int minId=getTheLeastOccuringTerm();
 for (int i=0; i<A[minId][d].length;i++){ // For every position for LeastOccuringTerm
   for( int t=0;t<numOfTerms;t++){ // For every terms
      int id=BinarySearch(A[t][d], A[minId][d][i] - minId + t);
      if (id<0) docPhrase=false;
   }
 }
}

Maybe post your current implementation in code just for reference. — OmniOwl, Dec 18 '12 at 00:07
@MelNicholson ... but wonder if there is algorithmically better way of doing this. — DotNet, Dec 18 '12 at 00:10
Do you need to store this all beforehand? Or can you populate the structure in real time (eg. as people search)? — sdasdadas, Dec 18 '12 at 00:11
@sdasdadas I'm not sure what do you mean by "store". The array is not stored but fetched from an index and that is fast and no problem with it. Counting is. — DotNet, Dec 18 '12 at 00:13
Sounds like the problem that Suffix Arrays solve. http://en.wikipedia.org/wiki/Suffix_array This answer I gave to a slightly different question shows a simple implementation of a Suffix Array: http://stackoverflow.com/questions/10606728/fastest-way-to-search-a-list-of-names-in-c-sharp There are a fair number of implementations floating around here on SO and on the web. — hatchet - done with SOverflow, Dec 18 '12 at 00:14
The nice thing about the Suffix Array is that you can search directly for any particular phrase, and it will find all occurrences. A java version of what I wrote for the answer I linked to above should be very close to what you're looking for, except your documents are what I'm saying are strings. — hatchet - done with SOverflow, Dec 18 '12 at 00:27
@hatchet The size is about 1200 words. Thanks for pointing the Suffix Array but I need the array structure. Can Suffix Array pinpoint at which positions the phrase is found? — DotNet, Dec 18 '12 at 00:39
What are you trying to achieve? NoSQL databases like lucene make this question moot IMHO. They perform extremely well and can deal with the same type of problem that this question hints at. — Bohemian, Dec 18 '12 at 01:02
@DotNet - in the code I linked to, it returns the indexes of all documents having the phrase, and all positions within each of those documents where the phrase was found. — hatchet - done with SOverflow, Dec 18 '12 at 01:07
@Bohemian I am trying to achieve exactly what I've stated in the question :) a fast and efficient method of doing this. I'm sure Lucene does that but the question is how ;) — DotNet, Dec 18 '12 at 01:09

score 2 · Accepted Answer · edited May 23 '17 at 12:27

As I mentioned in comments, Suffix Array can solve this sort of problem. I answered a similar question ( Fastest way to search a list of names in C# ) with a simple c# implementation of a Suffix Array.

The basic idea is you have an array of index pairs that point to a document index, and a position within that document. The index pair represents the string that starts at that point in the document, and continues to the end of the document. But the actual documents and their contents exist only once in your original store. The Suffix Array is just an array of these index pairs, with a pair for every position in every document. You then sort the Suffix Array in the order of the text they point to. Once sorted, you can now very quickly find any phrase among any of the documents by doing a simple Binary Search on the Suffix Array. Constructing (mainly sorting) the Suffix Array can be time consumptive. But once constructed, it is very fast to search on. It's fairly easy on memory since the actual document contents only exist once.

It would be trivial to extend it to returning counts of phrase matches within each document.

This is a little different than the classic description of a Suffix Array where they are usually talking about the Suffix Array operating over one single, very large string. But the changes to make it work for an array of strings/documents is not that large, although it can increase the amount of memory consumed by the Suffix Array depending on the maximum number of documents and the maximum document length, and how you encode the index pairs.

Fast and efficient computation on arrays

1 Answers1