Effectively derive term co-occurrence matrix from Google Ngrams

Question

I need to use the lexical data from Google Books N-grams to construct a (sparse!) matrix of term co-occurrences (where rows are words and columns are the same words, and the cells reflect how many times they appear in the same context window). The resulting tcm would then be used to measure a bunch of lexical statistics and serve as input into vector semantics methods (Glove, LSA, LDA).

For reference, the Google Books (v2) dataset is formatted as follows (tab-separated)

ngram      year    match_count    volume_count
some word  1999    32             12            # example bigram

However, problem is of course, these data be superhuge. Although, I will only need a subset of the data from certain decades (about 20 years worth of ngrams), and I am happy with a context window of up to 2 (i.e., use the trigram corpus). I have a few ideas but none seem particularly, well, good.

-Idea 1- initially was more or less this:

# preprocessing (pseudo)
for file in trigram-files:
    download $file
    filter $lines where 'year' tag matches one of years of interest
    find the frequency of each of those ngrams (match_count)
    cat those $lines * $match_count >> file2
     # (write the same line x times according to the match_count tag)  
    remove $file

# tcm construction (using R)
grams <- # read lines from file2 into list
library(text2vec)
# treat lines (ngrams) as documents to avoid unrelated ngram overlap
it         <- itoken(grams)
vocab      <- create_vocabulary(it)
vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2)
tcm        <- create_tcm(it, vectorizer) # nice and sparse

However, I have a hunch this might not be the best solution. The ngram data files already contain the co-occurrence data in the form of n-grams, and there is a tag that gives the frequency. I have a feeling there should be a more direct way.

-Idea 2- I was also thinking of cat'ing each filtered ngram only once into the new file (instead of replicating it match_count times), then creating an empty tcm and then looping over the whole (year-filtered) ngram dataset and record instances (using the match_count tag) where any two words co-occur to populate the tcm. But, again, the data is big, and this kind of looping would probably take ages.

-Idea 3- I found a Python library called google-ngram-downloader that apparently has a co-occurrence matrix creation function, but looking at the code, it would create a regular (not sparse) matrix (which would be massive, given most entries are 0s), and (if I got it right) it simply loops through everything (and I assume a Python loop over this much data would be superslow), so it seems to be more aimed at rather smaller subsets of data.

edit -Idea 4- Came across this old SO question asking about using Hadoop and Hive for a similar task, with a a short answer with a broken link and a comment about MapReduce (none of which I am familiar with, so I would not know where to start).

But I'm thinking I can't be the first one with the need to tackle such a task, given the popularity of the Ngram dataset, and the popularity of (non-word2vec) distributed semantics methods that operate on a tcm or dtm input; hence ->

...the question: what would be a more reasonable/effective way of constructing a term-term co-occurrence matrix from Google Books Ngram data? (be it a variation of the proposed ideas of something completely different; R preferred but not necessary)

Can you give an example who will you count co-occurecesies for tri-grams? How it should look like. — Dmitriy Selivanov, Jan 25 '17 at 16:56
Well, using the (possibly naive) ngrams-as-documents approach, something like `x <- list(c("this", "is", "example"), c("example", "it", "is")); it <- itoken(x); vocab <- create_vocabulary(it); vectorizer <- vocab_vectorizer(vocab, skip_grams_window = 2); tcm <- create_tcm(it, vectorizer); print(vocab); print(tcm)` But this kind of feels like taking the long way around (books/docs -> to ngrams -> import ngrams as docs -> create skipgrams from ngrams -> create_tcm), while an ngram essentially states the fact of co-occurrence already, and the data gives a number of how many times any ngram occurs — user3554004, Jan 25 '17 at 17:37

Dmitriy Selivanov · Accepted Answer · 2017-01-25T18:49:26.490

I will give an idea of how you can do this. But it can be improved in several places. I specially wrote in a "spagetti-style" for better interpretability, but it can be generalized to more than tri-grams

ngram_dt = data.table(ngram = c("as we know", "i know you"), match_count = c(32, 54))
# here we split tri-grams to obtain words
tokens_matrix = strsplit(ngram_dt$ngram, " ", fixed = T) %>% simplify2array()

# vocab here is vocabulary from chunk, but you can be interested first 
# to create vocabulary from whole corpus of ngrams and filter non 
# interesting/rare words

vocab = unique(tokens_matrix)
# convert char matrix to integer matrix for faster downstream calculations 
tokens_matrix_int = match(tokens_matrix, vocab)
dim(tokens_matrix_int) = dim(tokens_matrix)

ngram_dt[, token_1 := tokens_matrix_int[1, ]]
ngram_dt[, token_2 := tokens_matrix_int[2, ]]
ngram_dt[, token_3 := tokens_matrix_int[3, ]]

dt_12 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_1, token_2)]
dt_23 = ngram_dt[, .(cnt = sum(match_count)), keyby = .(token_2, token_3)]
# note here 0.5 - discount for more distant word - we follow text2vec discount of 1 / distance
dt_13 = ngram_dt[, .(cnt = 0.5 * sum(match_count)), keyby = .(token_1, token_3)]

dt = rbindlist(list(dt_12, dt_13, dt_23))
# "reduce" by word indices again - sum pair co-occurences which were in different tri-grams
dt = dt[, .(cnt = sum(cnt)), keyby = .(token_1, token_2)]

tcm = Matrix::sparseMatrix(i = dt$token_1, j = dt$token_2, x = dt$cnt, dims = rep(length(vocab), 2), index1 = T, 
                   giveCsparse = F, check = F, dimnames = list(vocab, vocab))

Effectively derive term co-occurrence matrix from Google Ngrams

1 Answers1