2

I have a DocumentTermMatrix and I´d like to replace specific terms in this document and to create a frequency table.

The starting point is the original document as follows:

library(tm)
library(qdap)

    df1 <- data.frame(word =c("test", "test", "teste", "hey", "heyyy", "hi"))
    tdm <- as.DocumentTermMatrix(as.character(df1$word))

When I create a frequency table of the original document I get the correct results:

freq0 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
freq0 

So far so good. However, if replace some terms in the document then the new frequency table gets wrong results:

    tdm$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm$dimnames$Terms), fixed=T, trim=T)
    freq1 <- as.matrix(sort(colSums(as.matrix(tdm)), decreasing=TRUE))
    freq1

Obviously or perhaps some indexing in the document is wrong because even same terms are not regarded as identical while counting the terms.

This outcome should be the ideal case:

df2 <- data.frame(word =c("test", "test", "test", "hey", "hey", "hi"))
tdm2 <- as.DocumentTermMatrix(as.character(df2$word))
tdm2$dimnames$Terms <- mgsub(c("teste", "heyyy"), c("test", "hey"), as.character(tdm2$dimnames$Terms), fixed=T, trim=T)
freq2 <- as.matrix(sort(colSums(as.matrix(tdm2)), decreasing=TRUE))
freq2

Can anyone help me to figure out the problem?

Thx in advance

OAM
  • 179
  • 1
  • 14
  • I get error with `as.DocumentTermMatrix(as.character(df1$word))#Error in .TermDocumentMatrix(t(x), weighting) : argument "weighting" is missing, with no default` – akrun Jun 24 '16 at 11:40
  • library(qdap) had wrong position in the code. Now it should be reproducible – OAM Jun 24 '16 at 11:53
  • I am not sure why the `colSums` is used as `as.matrix(tdm)` is 1 row/5 column. You can try `m1 <- as.matrix(tdm); tapply(m1, dimnames(m1)[[2]], FUN = sum)` – akrun Jun 24 '16 at 12:01

1 Answers1

2

We can look at the structure of as.matrix(tdm)

str(as.matrix(tdm))
#num [1, 1:5] 1 1 1 2 1
# - attr(*, "dimnames")=List of 2
#  ..$ Docs : chr "all"
# ..$ Terms: chr [1:5] "hey" "heyyy" "hi" "test" ...

which is one row, 5 column matrix, so colSums is basically not doing anything.

xtabs(as.vector(tdm)~tdm$dimnames$Terms)
#tdm$dimnames$Terms
#  hey heyyy    hi  test teste 
#   1     1     1     2     1 

and after replacing using mgsub

xtabs(as.vector(tdm)~tdm$dimnames$Terms)
#tdm$dimnames$Terms
# hey   hi test 
#  2    1    3 

The xtabs does the sum of the vector. This can also be done with tapply

 tapply(as.vector(tdm), tdm$dimnames$Terms, FUN = sum)

If the number of rows are greater than 1, we can use colSums

 tapply(colSums(as.matrix(tdm)),  tdm$dimnames$Terms, FUN = sum)
 # hey   hi test 
 #  4    2    6 

NOTE: The above output is after we made the changes with mgsub

akrun
  • 874,273
  • 37
  • 540
  • 662
  • thx, this is great answer! But what is the case if I have two documents and therefore two rows and 5 columns, then the function above does not work: `df1 <- data.frame(word =c("test", "test", "teste", "hey", "heyyy", "hi")) df1.1 <- data.frame(word =c("teste", "teste", "teste", "hey", "heyyy", "hi")) tdm <- as.DocumentTermMatrix(as.character(df1$word)) tdm2 <- as.DocumentTermMatrix(as.character(df1.1$word)) tdm <- c(tdm, tdm2)` – OAM Jun 24 '16 at 12:19
  • @OAM In that case you can use `colSums` and use that in the lhs of '~` . – akrun Jun 24 '16 at 12:20
  • @OAM i.e. `tapply(colSums(as.matrix(tdm)), tdm$dimnames$Terms, FUN = sum)` – akrun Jun 24 '16 at 12:22
  • 1
    great and perfect! – OAM Jun 24 '16 at 12:24