I am looking for an efficient way to create a term co-occurrence matrix for (each) target word in a corpus, such that each occurrence of the word would constitute its own vector (row) in a tcm, where the columns are the context words (i.e., a token-based model of co-occurrence). This is in contrast with the more common apprach used in vector semantics where each term (type) gets a row and a column in a symmetric tcm, and the values are aggregated across the (co-)occurrences of the tokens of the types.
Obviously this could be done from scratch using base R functionality, or hacked by filtering a tcm generated by one of the existing packages that do those, but the corpus data I'm dealing with is rather big (millions of words) - and there are already nice corpus/NLP packages available for R that do these sort of tasks efficiently and store the results in memory-friendly sparse matrices - such as text2vec
(function tcm
), quanteda (fcm
) and tidytext (cast_dtm
). Therefore it does not seem to make sense to try to reinvent the wheel (in terms of iterators, hashing and whatnot). But I cannot spot a straightforward way to create a token-based tcm with any of these either; hence this question.
Minimal example:
library(text2vec)
library(Matrix)
library(magrittr)
# default approach to tcm with text2vec:
corpus = strsplit(c("here is a short document", "here is a different short document"), " ")
it = itoken(corpus)
tcm = create_vocabulary(it) %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))
# results in this:
print(as.matrix(forceSymmetric(tcm, "U")))
different here short document is a
different 0 0 1 1 1 1
here 0 0 0 0 2 2
short 1 0 0 2 1 2
document 1 0 2 0 0 1
is 1 2 1 0 0 2
a 1 2 2 1 2 0
Attempt to get token-based model, for target word "short":
i=0
corpus = lapply(corpus, function(x)
ifelse(x == "short", {i<<-i+1;paste0("short", i)}, x )
) # appends index to each occurrence so itoken distinguishes them
it = itoken(corpus)
tcm = create_vocabulary(it) %>% vocab_vectorizer() %>% create_tcm(it, . , skip_grams_window = 2, weights = rep(1,2))
attempt = as.matrix(forceSymmetric(tcm, "U") %>%
.[grep("^short", rownames(.)), -grep("^short", colnames(.))]
) # filters the resulting full tcm
# yields intended result but is hacky/slow:
print(attempt)
different here document is a
short2 1 0 1 0 1
short1 0 0 1 1 1
What is a better/faster alternative to this approach to derive a token-based tcm like in the last example? (possibly using one of R packages that already do type-based tcms)