0

I'm using a for loop to create a document term matrix. My actual problem uses an obscure package called RMeCab to tokenize Japanese text, but here a more standard equivalent using strsplit. My current code:

Documents <- data.frame(Names= c("A","B"),Texts=c("A string of words","A different string"), stringsAsFactors = FALSE)
OUTPUT <- NULL
COMBINED <- NULL
i <- 1
for (i in 1:length(Documents$Texts)){
  OUTPUT <- data.frame(unlist(strsplit(Documents$Texts, " ")))
  OUTPUT$doc <- Documents$Names[i]
  COMBINED <- rbind(COMBINED, OUTPUT)
}
Document_Term_Matrix <- as.data.frame.matrix(table(COMBINED))

It works, but I'd like to use a more efficient apply function. If I run

L_OUTPUT <- lapply(Documents[,2],function(x) strsplit(x, " "))

I get the separate words as elements of a list, but how do I append the document name from Documents$Names?

More specifically with a list structure:

[[1]]

これ です   は ぺん 

   1    1    1    1 

[[2]]

です   は   人   彼 

   1    1    1    1 

How do I get a data with a column like this これ は ぺん です 彼 は 人 です And the second column showing the documents names One One One One Two Two Two Two

Those words corresponding to the list elements [[1]], [[2]], etc.

Sotos
  • 51,121
  • 6
  • 32
  • 66
Mark R
  • 775
  • 1
  • 8
  • 23
  • Why do you have 2 same columns for your resulting matrix? Isn't this: `table(unlist(strsplit(Documents$Texts, ' ')))` what you need? – Sotos May 06 '16 at 17:55
  • 1
    It appears to me that your code is producing duplicates, so that the strings are repeated for both A and B. Is this what you want? – lmo May 06 '16 at 18:17
  • Maybe `lapply(strsplit(Documents$Texts, ' '), table)` ... ? – Sotos May 06 '16 at 18:17
  • The list structure of the output of RMeCab is different from strsplit, so my attempt to be helpful backfired. But tweaking your suggestion `code` data.frame(sapply(RMeCabDF(Documents,2), table)) `code` gets me there. I hadn't realize that a function could be used in the first part of sapply as the variable, with a second function after the comma. That's what going on, right? – Mark R May 06 '16 at 18:22
  • 1
    Not exactly clear what the output you want is, but I think this might be it... Documents$strlist <- strsplit(Documents$Texts, " "); Document_Term_Matrix <- sapply(Documents$strlist,table); names(Document_Term_Matrix) <- Documents$Names – dww May 06 '16 at 18:23
  • No, I wrote too soon . . this code `code`lapply(RMeCabDF(Documents,2), table)`code` generates this list [[1]] これ です は ぺん 1 1 1 1 [[2]] です は 人 彼 1 1 1 1 How can I beat that into a dataframe with zeroes for the terms that don't appear? – Mark R May 06 '16 at 18:34
  • Apologizes for using strplit as an example. I can't use the tm library because my texts are in Japanese and the function is an obscure tokenizer. I've updated the question to show the list structure and the dataframe I'd like – Mark R May 06 '16 at 18:44

2 Answers2

2

It is better to use packages such as tm for these kind of operations, but here is a solution using base R,

list1 <- strsplit(Documents$Texts, ' ')
v1 <- unique(unlist(list1))

Document_Term_Matrix <- as.data.frame(t(sapply(v1, function(i) lapply(list1, function(j)
                                                                      sum(grepl(i, j))))))
names(Document_Term_Matrix)<- Documents$Names
Document_Term_Matrix
#          A B
#A         1 1
#string    1 1
#of        1 0
#words     1 0
#different 0 1
Sotos
  • 51,121
  • 6
  • 32
  • 66
  • How could you adapt this to handle punctuation? – ajrwhite May 06 '16 at 19:33
  • 1
    `gsub` with some regex is popular for removing punctuation. Have alook [at this](http://stackoverflow.com/questions/17694438/how-do-i-remove-all-punctuation-from-a-string-except-comma-in-r) for example – Sotos May 06 '16 at 19:37
0

you can use functions from tm package which are suitable for large text datasets:

library(tm)

# create corpora from your documents
corp = VCorpus(DataframeSource(Documents), readerControl = list(reader = readTabular(mapping = list(content = "Texts", id = "Names"))))

# create term document matrix
tdm = TermDocumentMatrix(corp, control = list(tokenize = function(x) unlist(strsplit(as.character(x), "[[:space:]]+"))
                                          , stopwords = FALSE
                                          , tolower = TRUE
                                          , weighting = weightTf))
inspect(tdm)

# get the result as matrix
tdm.m = matrix(tdm, nrow = tdm$nrow, ncol = tdm$ncol)
rownames(tdm.m) = tdm$dimnames$Terms
colnames(tdm.m) = tdm$dimnames$Docs

I also think there is a mistake in your question (but I cannot add comments). You're missing [i] in your for cycle, so you get the total number of terms in all the documents. It should be something like this:

OUTPUT <- data.frame(unlist(strsplit(Documents$Texts[i], " ")))
Lenka Vraná
  • 1,686
  • 2
  • 19
  • 29