How to make R tm corpus of 100 million tweets?

Question

I want to make a text corpus of 100 million tweets using R’s distributed computing tm package (called tm.plugin.dc). The tweets are stored in a large MySQL table on my laptop. My laptop is old, so I am using a Hadoop cluster that I set up on Amazon EC2.

The tm.plugin.dc documentation from CRAN says that only DirSource is currently supported. The documentation seems to suggest that DirSource allows only one document per file. I need the corpus to treat each tweet as a document. I have 100 million tweets -- does this mean I need to make 100 million files on my old laptop? That seems excessive. Is there a better way?

What I have tried so far:

Make a file dump of the MySQL table as a single (massive) .sql file. Upload the file to S3. Transfer the file from S3 to the cluster. Import the file into Hive using Cloudera’s Sqoop tool. Now what? I can’t figure out how to make DirSource work with Hive.
Make each tweet an XML file on my laptop. But how? My computer is old and can’t do this well. ... If I could get past that, then I would: Upload all 100 million XML files to a folder in Amazon’s S3. Copy the S3 folder to the Hadoop cluster. Point DirSource to the folder.

How are the tweets stored in your database? As XML? and how big is the file? I assume you are interested in the content of each tweet and build the corpus out of it. You can read the corpus from a single file, you do not need to create millions of files. — Dr VComas, May 15 '13 at 21:08
@Dr VComas: I'm not sure what you mean by 'how are the tweets stored' - I've got different varchar columns in my table, if that's what you mean. For example, one columns is called textOfTweet, and another is called statusID. MySQL lets me export the contents as an XML file, but the tweets are not stored as XML files. I can export in a variety of formats: csv, tab, sql, XML. Yup, I am interested in the content of each tweet. Each tweet also has unique meta data associated with it that I want to keep. — user554481, May 15 '13 at 23:14
@Dr VComas: Is there a way to create multiple documents from a single file in a distributed way? For example on Amazon EC2? My table is slightly over 10 GB (it grows each day). — user554481, May 15 '13 at 23:17
You can build your corpus from a csv file for example with the text in one the columns. I have done that for smaller files, in your case the size is the problem, do not know if it helps but you can expand your corpus easily, in the case you have several csv files with the tweets. It should be an easier way for sure. — Dr VComas, May 16 '13 at 14:32
@Dr VComas: Is it possible to extract multiple 'documents' from a single file? Would it be possible to do the extraction in a parallel or distributed way? Would you need to write a custom Reader function? — user554481, May 17 '13 at 00:23
looks like you have to write a custom reader function, but you should contact the maintainer http://statmath.wu.ac.at/~theussl/ — Patrick McCann, Jun 12 '13 at 21:57

score 4 · Answer 1 · answered May 19 '14 at 13:44

wouldn't be easier and more reasonable to make huge HDFS file with 100 million tweets and then process them by standard R' tm package?

This approach seems to me more natural since HDFS is developed for big files and distributed environment while R is great analytical tool but without parallelism (or limited). Your approach looks like using tools for something they were not developed for...

score 2 · Answer 2 · answered Jun 25 '13 at 23:50

2

I would strongly recommend to check this url http://www.quora.com/How-can-R-and-Hadoop-be-used-together. This will give you necessary insights to your problem.

answered Jun 25 '13 at 23:50

Siva Karthikeyan

544
8
25

score 2 · Answer 3 · answered Jun 17 '14 at 07:41

TM package basically works on term and document model. It creates a term document matrix or document term matrix. This matrix contains features like term (word) and its frequency in the document. Since you want to perform analysis on twitter data you should have each tweet as document and then you can created TDM or DTM. And can perform various analysis like finding associations, finding frequencies or clustering or calculating TDF-IDF measure etc.

You need to build a corpus of directory source. So you need to have base directory which contains individual documents which is your tweet.

Depending on the OS you are using, What I would have done if windows will create .bat file or a simple javascript or java code to read the MySQL rows for the tweet file and FTP it a directory present on your local file system of Hadoop Box.

Once the files were FTP's we can copy the directory to HDFS by using Hadoop Copy From Local Command.

How to make R tm corpus of 100 million tweets?

3 Answers3