I need to create a phrase frequency table, counting all phrases in a very large collection of a few million words words. The end result would be a table such as what is created here: http://www.hermetic.ch/wfca/phrases.htm
What would be an efficient algorithm to implement this? It would be even better to see it implemented in Ruby if you're able to show some specifics. Or, frankly, I'm even open to using xapian or lucene, but not seeing an immediate way to accomplish this with these in terms of building a frequency table output as desired.