I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.
What's the best and most cost efficient way to do it in the next day or so?
My thoughts are:
- PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
- MapReduce with Hadoop
- MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about
My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce. PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.
Any other ways to do it? What's my best bet for getting it done in the next couple days?