I'm using a for
loop to create a document term matrix. My actual problem uses an obscure package called RMeCab
to tokenize Japanese text, but here a more standard equivalent using strsplit
. My current code:
Documents <- data.frame(Names= c("A","B"),Texts=c("A string of words","A different string"), stringsAsFactors = FALSE)
OUTPUT <- NULL
COMBINED <- NULL
i <- 1
for (i in 1:length(Documents$Texts)){
OUTPUT <- data.frame(unlist(strsplit(Documents$Texts, " ")))
OUTPUT$doc <- Documents$Names[i]
COMBINED <- rbind(COMBINED, OUTPUT)
}
Document_Term_Matrix <- as.data.frame.matrix(table(COMBINED))
It works, but I'd like to use a more efficient apply
function. If I run
L_OUTPUT <- lapply(Documents[,2],function(x) strsplit(x, " "))
I get the separate words as elements of a list, but how do I append the document name from Documents$Names?
More specifically with a list structure:
[[1]]
これ です は ぺん
1 1 1 1
[[2]]
です は 人 彼
1 1 1 1
How do I get a data with a column like this
これ は ぺん です 彼 は 人 です
And the second column showing the documents names
One One One One Two Two Two Two
Those words corresponding to the list elements [[1]], [[2]], etc.