0

Maybe I misinterpret how tm::DocumentTermMatrix works. I have a corpus which after preprocessing looks like this:

head(Description.text, 3)
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"                    
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"     
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"

which I process via:

Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
    bounds = list(local = c(3, Inf)),
    tokenize = 'scan'
))

when I inspect the first row of the DTM i get this:

inspect(Description.text.features[1,])
<<DocumentTermMatrix (documents: 1, terms: 887)>>
Non-/sparse entries: 0/887
Sparsity           : 100%
Maximal term length: 15
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs banc camill mar martin ospedal presid san sanitar torin vittor
   1    0      0   0      0       0      0   0       0     0      0

These terms don't correspond to the fist document in the corpus Description.text (eg. banc or camill are not in the first document and there is a zero for eg martin or presid which are).

Furthermore if I run:

Description.text.features[1,] %>% as.matrix() %>% sum

I get zero, showing that in the first document there are no terms with frequency > zero!

What's going on here?

Thanks

UPDATE

I created my own 'corpus to dtm' function and indeed it gives very different results. Apart from document terms weights very different from those of tm::DocumentTermMatrix (mine are what you would expect given the corpus), I get much more terms with my function than with the tm function (~3000 vs 800 of tm).

Here's my function:

corpus.to.DTM <- function(corpus, min.doc.freq = 3, minlength = 3, weight.fun = weightTfIdf) {
    library(dplyr)
    library(magrittr)
    library(tm)
    library(parallel)

    lvls <- mclapply(corpus, function(doc) words(doc) %>% unique, mc.cores = 8) %>%
        unlist %>%
        table %>%
        data.frame %>%
        set_colnames(c('term', 'freq')) %>%
        mutate(lengths = str_length(term)) %>%
        filter(freq >= min.doc.freq & lengths >= minlength) %>%
        use_series(term)

    dtm <- mclapply(corpus, function(doc) factor(words(doc), levels = lvls) %>% table %>% as.vector, mc.cores = 8) %>%
        do.call(what = 'rbind') %>%
        set_colnames(lvls)

    as.DocumentTermMatrix(dtm, weighting = weightTfIdf) %>%
        as.matrix() %>%
        as.data.frame()
}
Bakaburg
  • 3,165
  • 4
  • 32
  • 64

1 Answers1

1

Here's a workaround using the tm alternative, quanteda. You might even find the relative simplicity of the latter, combined with its speed and features, sufficient to use it for the rest of your analysis too!

description.text <- 
  c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram",
    "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur",
    "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin")

require(quanteda)
require(magrittr)

qdfm <- dfm(description.text)
head(qdfm, nfeat = 10)
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse).
# (showing first 3 documents and first 10 features)
#        features
# docs    azi sanitar local to1 presid osp martin ospedalier tofan torin
#   text1   1       1     1   1      2   1      2          1     1     1
#   text2   0       0     0   0      0   0      2          0     1     2
#   text3   0       0     0   0      0   0      2          0     0     0

qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3)
qdfm2
# Document-feature matrix of: 3 documents, 2 features (0% sparse).
# (showing first 3 documents and first 2 features)
#        features
# docs    martin ospedal
#   text1      2       1
#   text2      2       2
#   text3      2       2

To convert back to tm:

convert(qdfm2, to = "tm")
# <<DocumentTermMatrix (documents: 3, terms: 2)>>
# Non-/sparse entries: 6/0
# Sparsity           : 0%
# Maximal term length: 7
# Weighting          : term frequency (tf)

In your example you use tf-idf weighting. That's also easy in quanteda:

dfm_weight(qdfm, "tfidf") %>% head
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse).
# (showing first 3 documents and first 6 features)
#          features
# docs          azi   sanitar     local       to1    presid       osp
#   text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213
#   text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#   text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
Ken Benoit
  • 14,454
  • 27
  • 50
  • thanks for the suggestion! I'll give a look to the package! but my question was specifically about what was going wrong with tm! – Bakaburg Aug 14 '17 at 14:19