0

I am trying to fetch data in parallel from MySQL database using R. Following code is fetching data one by one and working fine. But I want to speed up the process by sending multiple queries and save it into different variables. Later I will merge timeseries inside the variables.

library(RMySQL)
dbConnect(MySQL(), user='external', password='xxxxxxx', dbname='GMT_Minute_Data', host='xx.xx.xxx.xxx')

sqlData <-select TradeTime, Open, High, Low, Close from ad where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data1= dbFetch(sqlData, n=-1)
sqlData <-select TradeTime, Open, High, Low, Close from ty where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data2 = dbFetch(sqlData, n=-1)
sqlData <-select TradeTime, Open, High, Low, Close from ax where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data3 = dbFetch(sqlData, n=-1)

connections <- dbListConnections(MySQL())
for(i in connections) {dbDisconnect(i)}

I have tried to fetch data in parallel using following code:

library(foreach)
library(doParallel)
library(RMySQL)

fetchData<- function(nInst, inst1, inst2, inst3, inst4, inst5, startDate, endDate, con1){

  inst<-NULL
  sqlData <-NULL

  if(nInst==1)
    inst<-inst1
  else if(nInst==2)
    inst<-inst2
  else if(nInst==3)
    inst<-inst3
  else if(nInst==4)
    inst<-inst4
  else if(nInst==5)
    inst<-inst5

  sqlData <- dbSendQuery(con1, paste0('select TradeTime, Open, High, Low, Close from ', inst, ' where tradetime between \'',  startDate, '\' and \'',  endDate, '\'' ))
  data1 = dbFetch(sqlData, n=-1)
  print(head(data1))

  data1 
}

cluster = makeCluster(5, type = "SOCK")
registerDoParallel(cluster)
mydb <- NULL
clusterEvalQ(cluster, {

  mydb <- dbConnect(MySQL(), user='external', password='xxxxxx', dbname='GMT_Minute_Data', host='xx.xx.xxx.xxx')
  NULL
})


allDataList<-foreach(n =1:2, .verbose=TRUE, .packages=('RMySQL')) %dopar% {
  fetchData(n, inst1, inst2, inst3, inst4, inst5, startDate, endDate, mydb)

}
stopCluster(cluster)
on.exit(dbDisconnect(mydb))

Sometime code is only fetching data for the first instrument but not for the rest of the instruments.

Please assist if someone know the solution.

Thanks,

1 Answers1

0

I think the problem is that foreach is auto-exporting the mydb variable to the workers, thus defeating the purpose of initializing with it clusterEvalQ. Database connections can't be serialized and sent to other machines properly, which is why it's useful to initialize it manually with clusterEvalQ. The foreach .verbose=TRUE option let's you verify that mydb is not auto-exported. If it says that it is auto-exported, you need to prevent it.

In your example, you can prevent mydb from being auto-exported by simply removing the mydb <- NULL statement, but I suggest that you use the foreach .noexport='mydb' option to be certain that it's never auto-exported. Here's a stripped-down example that does that:

library(doParallel)

fetchData <- function(ignore) {
  mydb
}

cluster <- makeCluster(5, type = "SOCK")
registerDoParallel(cluster)

clusterEvalQ(cluster, {
  mydb <- sample(100, 1) # different value for each worker
  NULL
})

r <- foreach(n=1:2, .noexport='mydb', .verbose=TRUE) %dopar% { 
  fetchData(n)
}

In this case, foreach analyzes the fetchData function and notices that it's using a variable named mydb. Thus, if mydb is defined on the master, it will auto-export it unless you tell it not to. That's why I suggest using .noexport='mydb' even if it's not defined in the local environment. It makes doubly sure that your function doesn't use a corrupt database connection.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75