0

I'd like to cast my table into a DTM and maintain the metadata.

Same Data

Each row should be a document. But in order to use the cast_dtm(), there needs to be a count variable. In order to "cast", it needs to be in the "Document, Term, Count" format.

How do I convert my data into the "Document, Term, Count" dataframe? From there, it's easy to cast into a DTM, and then do what I need.

Alex
  • 77
  • 1
  • 10
  • 1
    Better would have been to code the data.frame creation, not include an image. You could have included the code I added in my answer below to create the object, for instance. – Ken Benoit Jun 21 '17 at 21:40

2 Answers2

2

try this

library(tm)
myCorpus <- Corpus(VectorSource(df))  
dtm <- DocumentTermMatrix(myCorpus)

I've used the above code for a textmining project except I had df replaced with df$column

sweetmusicality
  • 937
  • 1
  • 10
  • 27
2

You can also use the quanteda package.

To recreate your data.frame:

df <- data.frame(Date = c("2015-01-01", "2015-01-01", "2015-01-03", "2015-01-01"),
                 Group = "Cars",
                 Reporting = c(rep("A", 3), "B"),
                 Comments = c(rep("This car is awesome", 3), "No comments"),
                 stringsAsFactors = FALSE)
df
#         Date Group Reporting            Comments
# 1 2015-01-01  Cars         A This car is awesome
# 2 2015-01-01  Cars         A This car is awesome
# 3 2015-01-03  Cars         A This car is awesome
# 4 2015-01-01  Cars         B         No comments

Short way to a document-term matrix:

dfm(df$Comments)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
#        features
# docs    this car is awesome no comments
#   text1    1   1  1       1  0        0
#   text2    1   1  1       1  0        0
#   text3    1   1  1       1  0        0
#   text4    0   0  0       0  1        1

Long way to a dfm:

Make a corpus of it, including document variables:

require(quanteda)
myCorpus <- corpus(df, text_field = "Comments")
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
#  Text Types Tokens Sentences       Date Group Reporting
# text1     4      4         1 2015-01-01  Cars         A
# text2     4      4         1 2015-01-01  Cars         A
# text3     4      4         1 2015-01-03  Cars         A
# text4     2      2         1 2015-01-01  Cars         B
# 
# Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Wed Jun 21 23:34:35 2017
# Notes:  

Then:

dfm(myCorpus)
# Document-feature matrix of: 4 documents, 6 features (41.7% sparse).
# 4 x 6 sparse Matrix of class "dfmSparse"
#        features
# docs    this car is awesome no comments
#   text1    1   1  1       1  0        0
#   text2    1   1  1       1  0        0
#   text3    1   1  1       1  0        0
#   text4    0   0  0       0  1        1
Ken Benoit
  • 14,454
  • 27
  • 50