0

I have a curious error which only happens in my colleagues RStudio when they run the code. The code is dealing with text corpus, and this is what I do:

ap.corpus <- corpus(raw.data$text) 
 ap.corpus
#Corpus consisting of 214,226 documents and 0 docvars.
ap.corpus <- Corpus(VectorSource(ap.corpus))
    ap.corpus <- tm_map(ap.corpus,tolower)
ap.corpus<-corpus(ap.corpus)

The last step is just reformatting before I get to the model. I run this code smoothly with no issues. My to colleagues, on the other hand, try to run exactly same code on exactly the same data and get the following error after ap.corpus<-corpus(ap.corpus: nrow(docvars)==length(x) is not TRUE

We tried to reboot R studio, tried to run on a smaller corpus (only 500 doc), still same error. Hoping anyone else experienced similar error. This one doesn't appear to be the code issue, as I never experienced such error running this or similar codes in my RStudio. Note: my colleague also ran the code in R, avoiding RStudio. Same issue.

Len Greski
  • 10,505
  • 2
  • 22
  • 33
Nat
  • 19
  • 5
  • Have you run `sessionInfo()` on each machine to see whether there are any differences in the package versions between you and your colleague? Also, can you reproduce the error on your colleague's machine with only 5 documents? If so, would you please use `dput()` and post the data for 5 documents so your question is reproducible? – Len Greski Jan 05 '18 at 16:50
  • Thank you Len for suggestions. I will do so. Unfortunately I won't be able to try it today, as the other computer is in India, But I'll test it first thing once we connect again. – Nat Jan 05 '18 at 16:57

1 Answers1

0

This is impossible to verify without a reproducible example, but I have created one here since this might have been a bug. Based on my attempt to reproduce the reported error, however, I don't think that it is.

This sort of question would be better filed as an issue at the quanteda GitHub issues site rather than a SO question. But good to address here since I will also show you a way to avoid the use of tm (even though your example does not specify that, it is clear you are using some of its functions).

library("quanteda")
## quanteda version 0.99.22
## Using 7 of 8 threads for parallel computing

ap.corpus <- corpus(LETTERS[1:10])
ap.corpus
## Corpus consisting of 10 documents and 0 docvars.
texts(ap.corpus)
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "A"    "B"    "C"    "D"    "E"    "F"    "G"    "H"    "I"    "J" 

ap.corpus <- tm::Corpus(tm::VectorSource(ap.corpus))
## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 10
ap.corpus <- tm::tm_map(ap.corpus, tolower)

corpus(ap.corpus)
## Corpus consisting of 10 documents and 0 docvars.
corpus(ap.corpus) %>% texts()
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j" 

So that all appears to work just fine.

However, there is no need to use tm for this. You could have done the following in quanteda:

ap.corpus2 <- corpus(LETTERS[1:10])
texts(ap.corpus2) <- char_tolower(texts(ap.corpus2))
texts(ap.corpus2)
## text1  text2  text3  text4  text5  text6  text7  text8  text9 text10 
##   "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j" 

However we discourage you from modifying your corpus directly, since the is a destructive change that will mean that you cannot recover the cased version of your texts, should you wish to use these for other purposes.

Much better to use a workflow such as:

corpus(c("A B C", "C D E")) %>%
    tokens() %>%
    tokens_tolower()

## tokens from 2 documents.
## text1 :
## [1] "a" "b" "c"
## 
## text2 :
## [1] "c" "d" "e"
Ken Benoit
  • 14,454
  • 27
  • 50