4

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.

I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?

Community
  • 1
  • 1
benmcredmond
  • 1,702
  • 2
  • 15
  • 22
  • How do you store n-grams in Ruby? – Mladen Jablanović Jul 27 '10 at 20:02
  • 1
    As an array of words? You might save lots of memory (perhaps gain some speed as well) by converting them to symbols beforehand. – Mladen Jablanović Jul 27 '10 at 20:24
  • I was assuming that you were tokenizing the words initially. And then the problem is essentially the same as a compression problem, which I wish I knew more about. But there are a fair number of compression algorithms around. – Yellowfog Jul 28 '10 at 08:55

2 Answers2

1

If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/

smnirven
  • 1,518
  • 1
  • 13
  • 15
0

Your problem can be solved by following the steps below:

  • (Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
  • Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
  • Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
  • LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.

Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

Thanh DK
  • 4,257
  • 3
  • 23
  • 18