Efficent methods for finding most common phrases in a body of text AKA trending topics

Question

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.

I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?

As an array of words? You might save lots of memory (perhaps gain some speed as well) by converting them to symbols beforehand. — Mladen Jablanović, Jul 27 '10 at 20:24
I was assuming that you were tokenizing the words initially. And then the problem is essentially the same as a compression problem, which I wish I knew more about. But there are a fair number of compression algorithms around. — Yellowfog, Jul 28 '10 at 08:55

score 1 · Answer 1 · answered Jul 27 '10 at 20:20

1

If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/

answered Jul 27 '10 at 20:20

smnirven

1,518
1
13
15

Thanh DK · Answer 2 · 2014-03-17T08:09:20.613

Your problem can be solved by following the steps below:

(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.

Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

Efficent methods for finding most common phrases in a body of text AKA trending topics

2 Answers2