I have to cycle through several sets of data stored separately in individual csv files - about 1500 files. Each file represents the amount of activity for a given day, and the amount of data in each file varies from basically nothing to huge.
My process is I load the data into R, preprocess and clean, and then I insert into a local mysql table. This works fine running as a single loop.
To increase the speed of the load, I attempted to move the loop into a parallel process, where I have several workers preprocessing separate files and then each loading them into database.
I run into an error when some of the files are large, and the dbwritetable call takes some time to process. This appears to put a lock on the table, not allowing the other processes to complete their writes, leading to timeout errors.
I am fairly new to mysql, and I am vaguely aware I could change the mysql timeout time at the database level (using conf files, I believe). Is this best/only solution? I would prefer keeping all the control at the R level, so I could run this code on different machines more easily. I was wondering if there was another approach to keeping the processes in queue and writing the loads in order as the table is available.
Update:
I thought this was a simple question on how to handle timeouts, but it sounds like more information is desired. Here is one example - its not complicated.
Bluetooth.loadloop <- foreach(i=jump:LoopLength, .inorder=TRUE) %dopar% {
Start.Time <- Sys.time()
x <- filenames[i]
if(!length(readLines(x))) {
Stats <- data.frame(
i=i,
time=difftime(Sys.time(),Start.Time,units='secs'),
DataList= 0 )
source("/Shared_Functions/MYSQL_LOGIN.r" )
invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
dbDisconnect(con)
print( "Zero length");
return("Zero length")
} else {
### Loads Data from CSV
print(paste(i,"Start"))
datalist <-readLines(x,encoding="UTF-8")
filelocation <- (paste(i,x))
datalist.length <- length(datalist)
print(datalist.length)
### Proceses the individual fields
deviceID <- v1 <- gsub("([0-9]+),.*", "\\1", datalist)
Time <- v2 <- gsub("[0-9]+,([0-9]+),.*", "\\1", datalist)
Label <- v3 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\1", datalist)
MacAddress <- v4 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\2", datalist)
Strength <- v5 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\3", datalist)
Label <- BlueToothFilterRules(Label)
Encoding(Label) <- 'UTF-8'
BlueToothTable <- data.frame(i=i, DeviceID = deviceID, Time= Time, Label= Label, Mac = MacAddress, Strength= Strength, stringsAsFactors = FALSE)
Stats <- data.frame(
i=i,
time=difftime(Sys.time(),Start.Time,units='secs'),
DataList= datalist.length
)
### Writes to the Database
source("/Shared_Functions/MYSQL_LOGIN.r" )
dbSendQuery(con, 'set character set "utf8"')
invisible(dbWriteTable(con, name="RawData_Events_Bluetooth",value=BlueToothTable, append=TRUE, row.names=FALSE))
invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
dbDisconnect(con)
print(paste(i,"END"));
return("Finished")
}
}
So when this is run in parallel, like I said, it bottlenecks at the write, leading to timeouts.
I am not sure how to process what user3666197 said, other then its good to be prepared for these faults and have things well engineered in handling them.
Separating the readin from the write is not a realistic option either because the resulting object will be too large to handle well.
I understand that running this in parallel will provide limited improvements, because of the bottleneck, but it will decrease the total processing time by allowing the other workers to load in and processes the raw data while one is loading everything.
So my original question is still - how to best handle the mysql write queue so when running this in parallel it doesn't create timeout errors?
Thank you