Compute unweighted bag-of-words based TCM using text2vec in R?

Question

I am trying to compute a term-term co-occurrence matrix (or TCM) from a corpus using the text2vec package in R (since it has a nice parallel backend). I followed this tutorial, but while inspecting some toy examples, I noticed the create_tcm function does some sort of scaling or weighting on the term-term co-occurrence values. I know it uses skip-grams internally, but the documentation does not mention how it scales them - clearly, more distant terms/unigrams are weighted lower.

Here is an example:

tcmtest = function(sentences){
  tokens <- space_tokenizer(sentences)
  it = itoken(tokens, progressbar = FALSE)
  vocab <- create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L))
  vectorizer <- vocab_vectorizer(vocab,  grow_dtm = FALSE, skip_grams_window = 5L)
  return(create_tcm(it, vectorizer))
}

> tcmtest(c("a b", "a b c"))
3 x 3 sparse Matrix of class "dgTMatrix"
  b c   a
b . 1 2.0
c . . 0.5
a . . .  
> tcmtest(c("a b", "c a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
  b   c a
b . 0.5 2
c . .   1
a . .   .
> tcmtest(c("a b", "c a a a b"))
3 x 3 sparse Matrix of class "dgTMatrix"
  b    c        a
b . 0.25 2.833333
c . .    1.833333
a . .    .

Question: is there any way to disable this behaviour, so that every term/unigram in the skip-gram window is treated equally? I.e., if a term occurs inside the context window of another term twice in a corpus, it should say "2" in the TCM matrix.

Bonus question: how does the default scaling thing work anyway? If you add more "a"s to the last example, then the b-c value seems to linearly decrease, while the b-a value actually increases - although more occurrences or "a" appear further away from "b".

Bonus question: In the last example, "b" is the 4th letter from c. That's 1/4. For "a" and "c", you have "a" at 1, 2, and 3 letters from c. That's 1 + 1/2 + 1/3 = 1.833333. For "b" and "a", you have 2 occasions where "a" is right next to "b", then two more instances of "a" where it is at a distance of 2 and 3 from "b", respectively. That's 2 + 1/2 + 1/3. Since you have `skip_grams_window = 5`, `c("a b", "c a a a a a a b")` and `c("a b", "c a a a a a b")` will produce the same result from `tcmtest`. — Jota, Oct 30 '16 at 16:08
@Jota cool, that makes sense. Now how do I bypass this scaling thing? — user3554004, Oct 30 '16 at 16:13
@user3554004 in fact you are not using any parallel backend in your example. See how we create `jobs` in this [vignette](https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html#multicore_machines). Also I want to add, that using parallel backend to `tcm` make sense only for very large corpuses. — Dmitriy Selivanov, Oct 31 '16 at 06:01
@DmitriySelivanov thanks for the comment; I'm just starting out, and this is very helpful. If you don't mind me asking, where is the range where a corpus (in tokens) would be considered large enough to justify parallelizing the tcm job? (from what I understand, glove fitting uses low-level parallelism by default?) — user3554004, Nov 01 '16 at 13:39
@user3554004 GloVe fitting done with RcppParallel so there is no overhead (both ram and cpu). `tcm`creation is different - we compute it for each chunk and then sum them, [see here](https://github.com/dselivanov/text2vec/blob/93f656bc082c2f989da067b55dc1037b6a97d81d/R/tcm.R#L119-L142). However in this case we face overhead in memory consumption... As long as we have enough RAM we can try to grow it in parallel. But from my experience `tcm` building usually is not a bottleneck - we build it once and the reuse many times. Creation of `tcm` in 1 thread on english wikipedia takes ~ 2 hours. — Dmitriy Selivanov, Nov 01 '16 at 17:51
@user3554004 now (text2vec >= 0.5) you can specify weighting vector directly in `create_tcm` function. — Dmitriy Selivanov, Feb 01 '17 at 18:50

Dmitriy Selivanov · Accepted Answer · 2017-02-01T18:48:43.460

UPDATE 2017-02-01 Pushed update to github - now you can specify weighting vector directly in create_tcm.

Weighting function is defined here. If you need equal weight for each term within window, you need to adjust weighting function to always return 1 (just clone repo, change function definition and build package from source with devtools or R CMD build):

inline float weighting_fun(uint32_t offset) {
  return 1.0;
}

However several people already asked for this feature and I will probably include such option in next release.

Compute unweighted bag-of-words based TCM using text2vec in R?

1 Answers1