Maybe I misinterpret how tm::DocumentTermMatrix
works. I have a corpus which after preprocessing looks like this:
head(Description.text, 3)
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"
which I process via:
Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
bounds = list(local = c(3, Inf)),
tokenize = 'scan'
))
when I inspect the first row of the DTM i get this:
inspect(Description.text.features[1,])
<<DocumentTermMatrix (documents: 1, terms: 887)>>
Non-/sparse entries: 0/887
Sparsity : 100%
Maximal term length: 15
Weighting : term frequency (tf)
Sample :
Terms
Docs banc camill mar martin ospedal presid san sanitar torin vittor
1 0 0 0 0 0 0 0 0 0 0
These terms don't correspond to the fist document in the corpus Description.text
(eg. banc
or camill
are not in the first document and there is a zero for eg martin
or presid
which are).
Furthermore if I run:
Description.text.features[1,] %>% as.matrix() %>% sum
I get zero, showing that in the first document there are no terms with frequency > zero!
What's going on here?
Thanks
UPDATE
I created my own 'corpus to dtm' function and indeed it gives very different results. Apart from document terms weights very different from those of tm::DocumentTermMatrix
(mine are what you would expect given the corpus), I get much more terms with my function than with the tm function (~3000 vs 800 of tm).
Here's my function:
corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) {
library(dplyr)
library(magrittr)
library(tm)
library(parallel)
lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%
unlist %>%
table %>%
data.frame %>%
set_colnames(c('term', 'freq')) %>%
mutate(lengths = str_length(term)) %>%
filter(freq >= min.doc.freq & lengths >= minlength) %>%
use_series(term)
dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%
do.call(what = 'rbind') %>%
set_colnames(lvls)
as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>%
as.matrix() %>%
as.data.frame()
}