0

I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books

Below is the link of the full dataset:

Google Ngram Viewer

As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for batch processing of data. What are the other viable options considering this is an academic project with time constraints of a semester and limited availability of computational resources.

Note that real time querying on the data is not required

anshuman
  • 98
  • 7

1 Answers1

0

Hive has a built-in UDF for handling ngrams https://cwiki.apache.org/Hive/statisticsanddatamining.html#StatisticsAndDataMining-ngrams%2528%2529andcontextngrams%2528%2529%253ANgramfrequencyestimation

cran1um
  • 391
  • 1
  • 2