I have a system uploading data from Twi tter's Streaming API to a MySQL database in two steps:
1) I have data coming in from R to two separate files a day, an AM file and PM file, via code adapted from streamR.
2) Twice a day, I have the data parsed and organized in R, and then uploaded to a MySQL database using RMySQL.
There is a lag in the data showing up in MySQL - by 2-3 days, and this lag seems to keep increasing. RMySQL returns "DONE" after uploading, but the data are not queryable until a few days later. My collaborators pride themselves on real-time analysis of tweet data, so this is a nuisance.
Parsing/uploading code:
library(streamR)
tweets<-parseTweets("parse/after12.json")
#clean data so that mysql doesnt freak
cleanText <- function(text_vector){
text_vector<-sapply(text_vector, function(x) gsub("\'","\"",x))
text_vector<-sapply(text_vector, function(x) gsub("[\b\t\n]"," ",x))
return(text_vector)
}
tweets[,1:length(tweets)]<-cleanText(tweets[,1:length(tweets)])
#change date data to a format mysql can read
tweets$created_at<-as.POSIXlt(tweets$created_at, format= "%a %b %d %X %z %Y",tz="UTC")
library(chron)
library(RMySQL)
# Establish connection to database
mgr <- dbDriver("MySQL")
mycon <- dbConnect(mgr, user="user", host="name", dbname="dbname", password="pwd")
currentTime<-Sys.time()
currentDate<-unlist(lapply(strsplit(as.character(currentTime), " "), function(x) x[1]))
currentDate<-unlist(lapply(strsplit(as.character(currentDate), "-"), function(x) x[1:3]))
tableDate<-paste(currentDate[1], currentDate[2], sep="_") # to load into
dbWriteTable(mycon, paste("data_", tableDate, sep=""), tweets1, row.names=FALSE, append=T)
#upload complete
Also not that there is no indication that a process is running on the machine that holds the mysql database when this lag happens.
Any suggestions would be appreciated!