Producing ngram frequencies for a large dataset

Question

I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.

What's the best and most cost efficient way to do it in the next day or so?

My thoughts are:

PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
MapReduce with Hadoop
MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about

My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce. PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.

Any other ways to do it? What's my best bet for getting it done in the next couple days?

score 2 · Accepted Answer · answered Dec 06 '12 at 15:46

Mapreduce will work just fine, and probably you could do most of the input-output shuffling by pig.

See

http://arxiv.org/abs/1207.4371

for some algorithms.

Of course, to make sure you get a running start, you don't actually need to be using mapreduce for this task; just split the input yourself, make the simplest fast program to calculate ngrams of a single input file and aggregate the ngram frequencies later.

score 2 · Answer 2 · answered Dec 06 '12 at 18:30

Hadoop gives you two good things , which are main in my opinion: parralell task running (map only jobs) and distributed sort (shuffling between map and reduce
For the NGrams, it looks like you need both - parralel tasks (mappers) to emit ngrams and shuffling - to count number of each ngram.
So I think Hadoop here is ideal solution.

Producing ngram frequencies for a large dataset

2 Answers2