I want to make a text corpus of 100 million tweets using R’s distributed computing tm package (called tm.plugin.dc). The tweets are stored in a large MySQL table on my laptop. My laptop is old, so I am using a Hadoop cluster that I set up on Amazon EC2.
The tm.plugin.dc documentation from CRAN says that only DirSource is currently supported. The documentation seems to suggest that DirSource allows only one document per file. I need the corpus to treat each tweet as a document. I have 100 million tweets -- does this mean I need to make 100 million files on my old laptop? That seems excessive. Is there a better way?
What I have tried so far:
Make a file dump of the MySQL table as a single (massive) .sql file. Upload the file to S3. Transfer the file from S3 to the cluster. Import the file into Hive using Cloudera’s Sqoop tool. Now what? I can’t figure out how to make DirSource work with Hive.
Make each tweet an XML file on my laptop. But how? My computer is old and can’t do this well. ... If I could get past that, then I would: Upload all 100 million XML files to a folder in Amazon’s S3. Copy the S3 folder to the Hadoop cluster. Point DirSource to the folder.