0

I have a system uploading data from Twi tter's Streaming API to a MySQL database in two steps:

1) I have data coming in from R to two separate files a day, an AM file and PM file, via code adapted from streamR.

2) Twice a day, I have the data parsed and organized in R, and then uploaded to a MySQL database using RMySQL.

There is a lag in the data showing up in MySQL - by 2-3 days, and this lag seems to keep increasing. RMySQL returns "DONE" after uploading, but the data are not queryable until a few days later. My collaborators pride themselves on real-time analysis of tweet data, so this is a nuisance.

Parsing/uploading code:

 library(streamR)
    tweets<-parseTweets("parse/after12.json")

    #clean data so that mysql doesnt freak

    cleanText <- function(text_vector){
      text_vector<-sapply(text_vector, function(x) gsub("\'","\"",x))
      text_vector<-sapply(text_vector, function(x) gsub("[\b\t\n]"," ",x))
      return(text_vector)       
    }

    tweets[,1:length(tweets)]<-cleanText(tweets[,1:length(tweets)])

    #change date data to a format mysql can read
    tweets$created_at<-as.POSIXlt(tweets$created_at,  format= "%a %b %d %X %z %Y",tz="UTC")

library(chron)
library(RMySQL)

# Establish connection to database
mgr <- dbDriver("MySQL")
mycon <- dbConnect(mgr, user="user", host="name", dbname="dbname", password="pwd")

currentTime<-Sys.time()
currentDate<-unlist(lapply(strsplit(as.character(currentTime), " "), function(x) x[1]))
currentDate<-unlist(lapply(strsplit(as.character(currentDate), "-"), function(x) x[1:3]))


tableDate<-paste(currentDate[1], currentDate[2], sep="_") # to load into

dbWriteTable(mycon, paste("data_", tableDate, sep=""), tweets1, row.names=FALSE, append=T)

#upload complete

Also not that there is no indication that a process is running on the machine that holds the mysql database when this lag happens.

Any suggestions would be appreciated!

R-Enthusiast
  • 340
  • 1
  • 3
  • 10
  • 2
    This seems very strange to me. I would check from R by querying the data just uploaded right after the `dbWriteTable` command. If you see it that way, then the problem may be more likely in the db, or your colleague's connection than in R. – joran Jul 06 '13 at 22:59
  • 1
    BTW: Do you close your DB connection? – cryo111 Jul 07 '13 at 02:04
  • @cryo111, should I close the connection to make sure they upload? – R-Enthusiast Jul 07 '13 at 02:48
  • 1
    I would close the connection after your script has finished updating the DB. Could be that your MySQL DB storage engine is InnoDB and that no commit is executed unless you properly close your connection. Since it's only one additional line of code, it's worth a try. – cryo111 Jul 07 '13 at 15:31
  • @cryo111 the data have now simply stopped uploading, even after returning "TRUE" from the command line in R. Any suggestions? – R-Enthusiast Jul 10 '13 at 01:32
  • Do you mean that the data is not fully loaded into the database even though R states success (i.e. TRUE)? I have only used RODBC to connect to my MySQL database as yet. So, I have no experience with RMySQL. How large is the chunk of data you are loading into the table?If it is a large chunk, you could consider writing the data to a csv first and to use the MySQL command `LOAD INFILE` to read from the csv.This way you could also determine whether its the R->MySQL connection that causes the trouble. If the problem persists: are there any triggers/etc. in your SQL table that could cause problems? – cryo111 Jul 10 '13 at 02:41

0 Answers0