2

I am well practiced at simple implementations in R, however I am new to communicating to SQL through R and also parallel programming (no prior experience of these two things before today). I have written the below code with prompts from blogs, forums etc.

library(doParallel) 

library(RMySQL) 

library(DBI)

library(foreach)

cl <- makeCluster(12)

registerDoParallel(cl)

Postcodecsv <- read.csv("C:/Users/Henry Crosby/Desktop/PostcodeLatLong.csv")

mydb = dbConnect(MySQL(), user='****', password="******* ****", 

dbname='population_distance', host='****.**.*.*')

dbListFields(mydb,'Postcodes')

foreach (a = 1:120000, .combine="rbind") %dopar% {

  Done <- dbGetQuery(mydb, paste("select FID, Postcode2, (6371 * acos( cos( 
radians( ",Postcodecsv[a,6],"))*cos(radians(latitude))*cos(radians(Longitude)-radians(",Postcodecsv[a,5],"))+sin(radians(",Postcodecsv[a,6],") )* sin( radians( latitude ) ) ) ) AS distance from Postcodes having distance < 2 ORDER BY distance",sep=" "))

write.table(Done,file="C:/Users/Henry Crosby/Desktop/2km.csv",append=TRUE, col.names=FALSE, sep=",")

  } 

This calculation works in a for loop but takes forever (I have to apply this to a LARGE dataset!). When I run the above code I get the error below! Can someone tell me why the error is coming and how I can work around it!

Error in { : task 1 failed - "could not find function "dbGetQuery""

ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
  • I have found in https://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf that I must add .packages to the foreach when done in parallel. IE foreach (a = 1:120, .combine="rbind", .packages=c("DBI","RMySQL")). However I have a new error Error in { : task 1 failed - "Corrupt MySQL handle" – Henry Crosby Jan 06 '16 at 12:33

2 Answers2

1

The problem is that foreach is auto-exporting the mydb object, but it doesn't work correctly because objects that contain socket connections can't be serialized and copied between processes.

I suggest that you initialize the cluster workers using the clusterEvalQ function to create mydb on each of the workers. You could try something like:

clusterEvalQ(cl, {
  library(RMySQL)
  mydb <- dbConnect(MySQL(), user='****', password="******* ****",
                    dbname='population_distance', host='****.**.*.*')
  NULL
})

However, you still need to prevent foreach from auto-exporting mydb by using the foreach .noexport="mydb" option. You can also use the foreach .verbose=TRUE option to verify what objects are being auto-exported to the workers.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
0

Why use R to handle the sql to csv migration? Consider a full SQL solution and leave processing on the database engine. Specifically, join the Postcodes table and PostcodeLatLong csv data (of course, import into MySQL database). Then use MySQL's OUTFILE facility. Altogether, you only have to run dbGetQuery() once without parallel looping:

expotCSV <- dbGetQuery("SELECT FID, Postcode2, (6371 * acos( cos( 
                     radians(Postcodecsv.col6))*cos(radians(latitude))*cos(radians(Longitude)-
                     radians(Postcodecsv.col5))+sin(radians(Postcodecsv.col6))* sin(radians(latitude)))) 
                     AS distance     

              FROM Postcodes 
              INNER JOIN PostcodeLatLong ON csv.joinfield = PostcodeLatLong.joinfield
              HAVING distance < 2 
              ORDER BY distance

              INTO OUTFILE 'C:/Users/Henry Crosby/Desktop/2km.csv' 
              FIELDS ENCLOSED BY '\"' 
              TERMINATED BY ';' 
              ESCAPED BY '\"' 
              LINES TERMINATED BY '\\r\\n';")

The challenge is finding the joinfield or relationship between the two tables. However, SQL also allows cartesian product or the cross join (for seemingly unrelated tables) where a query returns the total combination set between listed tables. You can use a where clause by some filtering clause:

FROM Postcodes, PostcodeLatLong 
WHERE Postcodes.somefactor = "..." AND PostcodeLatLong.somefactor = "..."

Alternatively, MySQL's CROSS JOIN (JOIN with no ON clause)

FROM Postcodes CROSS JOIN PostcodeLatLong 
WHERE Postcodes.somefactor = "..." AND PostcodeLatLong.somefactor = "..."
Parfait
  • 104,375
  • 17
  • 94
  • 125