1

I have a total of 54892 documents. After retrieving them from the database, how am I supposed to convert them to a corpus that can be used for Topic Modelling using LDA?

This is the code I have tried:

library(RMySQL)
library(RTextTools)
library(topicmodels)
library(tm)

con <- dbConnect(MySQL(), user="root", password="root", dbname="dbtemp", host="localhost")
rs <- dbSendQuery(con, "select text_body from all_text;")
data <- fetch(rs, n=54892)
huh <- dbHasCompleted(rs)
dbClearResult(rs)
dbDisconnect(con)

I referred to this page, and noticed that the output of data from the line data <- NYTimes[sample(1:3100,size=1000,replace=FALSE),] contains a two column table along with another table with something called TopicCode, then this data is converted to a term-document frequency matrix. I don't know how to get that TopicCode from the two colums that I retrieved from the database?

I have tried a similar problem in Python where I converted the data to a Market Matrix format. I thought of using this file for further computations in R. I tried reading this file using b <- readMM(file="PRC.mm") and when I printed b I got a 336331X88 matrix which looked like :

. . 2 . . . . . . 1 1 . 1 . . 1 . 2 . . . . . . . . . . . . . ......
. 1 . . . . . . 1 1 . . . . . . . . . . . . . . . . . . . . . ......
. . . . . . . . . 1 1 1 . . . 2 . . . . . . . 1 . . 1 . . . . ......
. . 1 . . . 2 . . . . 1 1 . . . . . . . 1 . . . . . . . . . . ......

where . means 0. This looks like a term-document matrix but I still want to remake such kind of matrix in R. What should I do?

Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
  • 1
    Can you be more clear regarding your question? What have you tried towards making a corpus? – Roman Luštrik Feb 02 '14 at 16:31
  • I am using R for the first time for LDA. I have used Python for this purpose - [Here](http://stackoverflow.com/questions/16254207/can-we-use-a-self-made-corpus-for-training-for-lda-using-gensim). I have a Market Matrix corpus. – Animesh Pandey Feb 02 '14 at 16:42
  • 1
    So, now that you have done something (that we are unable to reproduce but for which you show no errors) please provide what `str(data)` and `str(huh)` return. Otherwise we must make unsupported guesses. – IRTFM Feb 02 '14 at 17:09
  • @RomanLuštrik I have edited the question. I think it might explain better. – Animesh Pandey Feb 02 '14 at 17:27
  • @IShouldBuyABoat I have edited the question. I think it might explain better. – Animesh Pandey Feb 02 '14 at 17:27

0 Answers0