I am trying to compute a term-term co-occurrence matrix (or TCM) from a corpus using the text2vec
package in R
(since it has a nice parallel backend). I followed this tutorial, but while inspecting some toy examples, I noticed the create_tcm
function does some sort of scaling or weighting on the term-term co-occurrence values. I know it uses skip-grams internally, but the documentation does not mention how it scales them - clearly, more distant terms/unigrams are weighted lower.
Here is an example:
tcmtest = function(sentences){
tokens <- space_tokenizer(sentences)
it = itoken(tokens, progressbar = FALSE)
vocab <- create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L))
vectorizer <- vocab_vectorizer(vocab, grow_dtm = FALSE, skip_grams_window = 5L)
return(create_tcm(it, vectorizer))
}
> tcmtest(c("a b", "a b c"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 1 2.0
c . . 0.5
a . . .
> tcmtest(c("a b", "c a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 0.5 2
c . . 1
a . . .
> tcmtest(c("a b", "c a a a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
b c a
b . 0.25 2.833333
c . . 1.833333
a . . .
Question: is there any way to disable this behaviour, so that every term/unigram in the skip-gram window is treated equally? I.e., if a term occurs inside the context window of another term twice in a corpus, it should say "2" in the TCM matrix.
Bonus question: how does the default scaling thing work anyway? If you add more "a"s to the last example, then the b-c value seems to linearly decrease, while the b-a value actually increases - although more occurrences or "a" appear further away from "b".