Implementing N-grams in my corpus, Quanteda Error

Question

I am trying to implement quanteda on my corpus in R, but I am getting:

Error in data.frame(texts = x, row.names = names(x), check.rows = TRUE,  : 
  duplicate row.names: character(0)

I don't have much experience with this. Here is a download of the dataset: https://www.dropbox.com/s/ho5tm8lyv06jgxi/TwitterSelfDriveShrink.csv?dl=0

Here is the code:

tweets = read.csv("TwitterSelfDriveShrink.csv", stringsAsFactors=FALSE)
corpus = Corpus(VectorSource(tweets$Tweet))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument)

quanteda.corpus <- corpus(corpus)

If you provide a reproducible example you will instantly enlarge the pool of possible answerers. In addition, once it gets answered it will benefit only you. By generalizing the problem you are helping future you and others down the line. — Roman Luštrik, Apr 14 '16 at 06:49
@gamelanguage, got the same error by using tm as in your approach, but you don't need tm, just quanteda, and stringsAsFactors = FALSE. — Chris, Apr 14 '16 at 10:41

score 1 · Accepted Answer · answered Apr 14 '16 at 10:30

The processing that you're doing with tm is preparing a object for tm and quanteda doesn't know what to do with it...quanteda does all of these steps itself, help("dfm"), as can be seen from the options.

If you try the following you can move ahead:

dfm(tweets$Tweet, verbose = TRUE, toLower= TRUE, removeNumbers = TRUE, removePunct = TRUE,removeTwitter = TRUE, language = "english", ignoredFeatures=stopwords("english"), stem=TRUE)

Creating a dfm from a character vector ... ... lowercasing ... tokenizing ... indexing documents: 6,943 documents ... indexing features: 15,164 feature types ... removed 161 features, from 174 supplied (glob) feature types ... stemming features (English), trimmed 2175 feature variants ... created a 6943 x 12828 sparse dfm ... complete. Elapsed time: 0.756 seconds. HTH

Ken Benoit · Answer 2 · 2018-01-28T15:41:59.757

No need to start with the tm package, or even to use read.csv() at all - this is what the quanteda companion package readtext is for.

So to read in the data, you can send the object created by readtext::readtext() straight to the corpus constructor:

myCorpus <- corpus(readtext("~/Downloads/TwitterSelfDriveShrink.csv", text_field = "Tweet"))
summary(myCorpus, 5)
## Corpus consisting of 6943 documents, showing 5 documents.
## 
## Text Types Tokens Sentences Sentiment Sentiment_Confidence
## text1    19     21         1         2               0.7579
## text2    18     20         2         2               0.8775
## text3    23     24         1        -1               0.6805
## text5    17     19         2         0               1.0000
## text4    18     19         1        -1               0.8820
## 
## Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
## Created: Thu Apr 14 09:22:11 2016
## Notes:

From there, you can perform all of the pre-processing stems directly in the dfm() call, including the choice of ngrams:

# just unigrams
dfm1 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"))
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 15,577 feature types
## ... removed 161 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 2174 feature variants
## ... created a 6943 x 13242 sparse dfm
## ... complete. 
## Elapsed time: 0.662 seconds.

# just bigrams
dfm2 <- dfm(myCorpus, stem = TRUE, remove = stopwords("english"), ngrams = 2)
## Creating a dfm from a corpus ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 6,943 documents
## ... indexing features: 52,433 feature types
## ... removed 24,002 features, from 174 supplied (glob) feature types
## ... stemming features (English), trimmed 572 feature variants
## ... created a 6943 x 27859 sparse dfm
## ... complete. 
## Elapsed time: 1.419 seconds.

Implementing N-grams in my corpus, Quanteda Error

2 Answers2