I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books
Below is the link of the full dataset:
As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for batch processing of data. What are the other viable options considering this is an academic project with time constraints of a semester and limited availability of computational resources.
Note that real time querying on the data is not required