quanteda not creating corpus from corpusSource object

Question

I am using windows 7 with a 32-bit operating system with 4Gb RAM of which only 3Gb is accessible due to 32-bit limitations. I shut everything else down and can see that I have about 1Gb as cached and 1Gb available before starting. The "free" memory varies but is sometimes 0.

Using quanteda - I am reading a twitter.txt file using the textfile() command which successfully creates a 157Mb corpusSource object. When I take the next step to convert it to a "corpus" using the corpus() command R blasts through it and creates a very small empty file with four elements all containing 0's..... Code and output follows:

twitterfile <- "./final/en_US/en_US.twitter.txt" 

precorp <- textfile(twitterfile)
corp <- corpus(twitterprecorp)
summary(corp)

Corpus consisting of 1 document.

              Text Types Tokens Sentences
 en_US.twitter.txt     0      0         0

Source:  C:/R_Data/Capstone/* on x86 by xxxxx
Created: Thu Aug 18 06:32:01 2016
Notes:   

Warning message:
In nsentence.character(object, ...) :
  nsentence() does not correctly count sentences in all lower-cased text

….Any insights on why this may be happening?

@HubertL points to an obvious issue that you need to check. Also is having a single document what you intended? Or does `en_US.twitter.txt` contain multiple "documents" in the form of multiple Tweets? — Ken Benoit, Aug 19 '16 at 08:45
Thanks both of you. I have updated the code as follows to make it simple I updated the code as follows and ended up with the same result: — B. McCracken, Aug 19 '16 at 15:59
twitterfile <- "./final/en_US/en_US.twitter.txt" precorp <- textfile(twitterfile) corp <- corpus(precorp) #this is generating a corpus with 4 empty items summary(corp) Corpus consisting of 1 document. Text Types Tokens Sentences en_US.twitter.txt 0 0 0 Source: C:/R_Data/Capstone/* on x86 by WM7132 Created: Fri Aug 19 08:56:44 2016 Notes: Warning message: In nsentence.character(object, ...) : nsentence() does not correctly count sentences in all lower-cased text — B. McCracken, Aug 19 '16 at 15:59
Ken, the en_US.twitter.txt file is one file of 238M tweets....I was creating a corpus with only one "document" so that I did not have to address which document in the corpus I was testing with. My goal is to use the corpus to try out all the commands in the good tutorials that are provided before I move forward with other aspects of the project. — B. McCracken, Aug 19 '16 at 16:05
Try `nchar(texts(precorp))` -- what does it return? Also what is `packageVersion("quanteda")`? — Ken Benoit, Aug 24 '16 at 14:24
Give the [**readtext** package](https://github.com/kbenoit/readtext) a try, it works very well with the new version of **quanteda** (v0.9.9), and is coming soon to CRAN. — Ken Benoit, Jan 11 '17 at 17:50

score 0 · Answer 1 · answered Oct 19 '16 at 02:29

textfile()

is giving you character vector, with a single element for the entire file. You probably want to use

readlines()

as in:

precorp <- readlines(twitterfile)

this will give you a character vector with an element for each line in the file. corpus() will then treat each element of the vector as a document when creating your corpus.

quanteda not creating corpus from corpusSource object

1 Answers1