How to avoid timeout error loading large files into mysql in parallel

Question

I have to cycle through several sets of data stored separately in individual csv files - about 1500 files. Each file represents the amount of activity for a given day, and the amount of data in each file varies from basically nothing to huge.

My process is I load the data into R, preprocess and clean, and then I insert into a local mysql table. This works fine running as a single loop.

To increase the speed of the load, I attempted to move the loop into a parallel process, where I have several workers preprocessing separate files and then each loading them into database.

I run into an error when some of the files are large, and the dbwritetable call takes some time to process. This appears to put a lock on the table, not allowing the other processes to complete their writes, leading to timeout errors.

I am fairly new to mysql, and I am vaguely aware I could change the mysql timeout time at the database level (using conf files, I believe). Is this best/only solution? I would prefer keeping all the control at the R level, so I could run this code on different machines more easily. I was wondering if there was another approach to keeping the processes in queue and writing the loads in order as the table is available.

Update:

I thought this was a simple question on how to handle timeouts, but it sounds like more information is desired. Here is one example - its not complicated.

  Bluetooth.loadloop <- foreach(i=jump:LoopLength, .inorder=TRUE) %dopar% {


    Start.Time <- Sys.time()
    x <- filenames[i]  

    if(!length(readLines(x))) {
      Stats <- data.frame(
        i=i, 
        time=difftime(Sys.time(),Start.Time,units='secs'),
        DataList= 0 )

      source("/Shared_Functions/MYSQL_LOGIN.r" )
      invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
      dbDisconnect(con)
      print( "Zero length");

      return("Zero length")
    } else {


      ### Loads Data from CSV
      print(paste(i,"Start"))
      datalist <-readLines(x,encoding="UTF-8")
      filelocation <- (paste(i,x))
      datalist.length <- length(datalist)
      print(datalist.length)


      ### Proceses the individual fields 
      deviceID <- v1 <- gsub("([0-9]+),.*", "\\1", datalist)
      Time <- v2 <- gsub("[0-9]+,([0-9]+),.*", "\\1", datalist)
      Label <- v3 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\1", datalist)
      MacAddress <- v4 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\2", datalist)
      Strength <- v5 <- gsub("^[0-9]+,[0-9]+,(.*),([a-zA-Z0-9:]+),([^,]+)$", "\\3", datalist)

      Label <- BlueToothFilterRules(Label)
      Encoding(Label) <- 'UTF-8'
      BlueToothTable <- data.frame(i=i, DeviceID = deviceID, Time= Time, Label= Label, Mac = MacAddress, Strength= Strength, stringsAsFactors = FALSE)

      Stats <- data.frame(
        i=i, 
        time=difftime(Sys.time(),Start.Time,units='secs'),
        DataList= datalist.length
      )

      ### Writes to the Database
      source("/Shared_Functions/MYSQL_LOGIN.r" )
      dbSendQuery(con, 'set character set "utf8"')
      invisible(dbWriteTable(con, name="RawData_Events_Bluetooth",value=BlueToothTable, append=TRUE, row.names=FALSE))
      invisible(dbWriteTable(con, name="RawData_Loadloop_Stats_Bluetooth",value=Stats, append=TRUE, row.names=FALSE))
      dbDisconnect(con)
      print(paste(i,"END"));

      return("Finished")
    }
  }

So when this is run in parallel, like I said, it bottlenecks at the write, leading to timeouts.

I am not sure how to process what user3666197 said, other then its good to be prepared for these faults and have things well engineered in handling them.

Separating the readin from the write is not a realistic option either because the resulting object will be too large to handle well.

I understand that running this in parallel will provide limited improvements, because of the bottleneck, but it will decrease the total processing time by allowing the other workers to load in and processes the raw data while one is loading everything.

So my original question is still - how to best handle the mysql write queue so when running this in parallel it doesn't create timeout errors?

Thank you

user3666197 · Answer 1 · 2018-01-02T19:10:34.760

Having no code in an MCVE-formulated question, let's go ahead without any code in the answer.

While all the due mathematical formulations are in the tail section of this, a human-readable story might be a better approach to get the solid grasp of the principal TimeOut problem root-cause:

Using a parallel syntax-constructor?
Syntactically possible in many places and many fashions, but ...

The point here ( always ) is, whether it makes any sense. For answering this, one has to know a sum of all the costs of using it.

Costs are, as our core motivation is processing speedup, the overheads in time we need to pay, before we can enjoy to use the "accelerated" results.

Let's sketch up an example - a Cape Cod airport Story:

A Lord Mayor of a city decided to build a new highway, so as to allow welcome citizens to ride faster to an airport on a tip of the Cape Cod.

This sounds logical and great. So, they started investing, 've paid hundreds of landlords for all the new plots acquired ( bought the property to raise the new road on ), stone miners to excavate all the needed rocks, millers for grinding it in the right sized stones, sands, cement and asphalt, all needed for the highway, bulldozers' operators to put the road rocks bases, concrete vendors for mixing and delivery of the cement in a Just-In-Time fashion, as the highway construction was going forwards etc, etc.

After some months or years ( depends on local economy, contractors' performance, city budget and its distance from the airport ... sure ), the highway was complete and the opening ceremony took place.

Two things were sure at this moment -- The Cost -- how much did it take to build such Highway in money, plus -- The Latency -- how long did it take in time, before a first car could ride from city as far as to the airport.

The last thing is -- The net Speedup achieved -- so how faster is this new, 100-parallel lane wide Highway finally, if compared to the old and for ages used, lovely countryside road.

The Result?

A known performance baseline first:

The cars A, B, C, D, ... X, Y, Z all have started its way to the airport about some months ( years ) ago, drove comfortably at a speed of ~40 mph and have all got to the airport in about one hour, one after another, in a pure [SERIAL] manner. Nothing sexy, but a solid, repeatable result, as next time, these will enjoy the almost same result again.

And new parallel-riders, to compare with:

The cars A1, A2, A3, ... A100, B101, ... B200, ... Z2600 all have the same goal, to arrive at the airport as fast as possible, yet had to wait the few months ( years ) for the new highway to get built ( an obvious processing setup latency ), next it took about a few minutes to enjoy the super-fast ride on this new highway, having Germany-like no speed limit, so 200 mph? 300 mph? No problem.

Until, a thrilling 3-minute ultra-fast experience of driving cars 300 mph on a 100-parallel-lanes wide New Highway has reached the coast-line and next, each and every car had to enter in a queue and wait for the next Ferry-boat arrival ( having a capacity of just about a 30 cars, serving also all the traditional, local trafic ).

Given a Ferry line has a service RTT ( Round Trip Time - loading first 30 cars, going to Cape Cod, unloading cars and loading those who go back, going from Cape Cod back to the continent port ) about some 40 minutes, the game is almost over.

Results:
- the whole group of A:Z has made the task in T0 +01:00:00 time
- the first subset of <=30-parallel-riders has made the same in T0 +8760:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8760:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:03:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:23:00
- the next subset of <=30-parallel-riders has made the same in T0 +8761:43:00
- the next subset of <=30-parallel-riders has made the same in T0 +8762:03:00
...
- the last subset of <=30-parallel-riders has made the same in T0 +8788:43:00

all this on conditions given:
- no local traffic ever appeared during these about 29 hours
- the first Ferry was both empty and ready to start loading the first set ~ 30 cars in port right upon theirs arrival
- no Ferry-boat service interruptions or time-table service irregularities ever appeared

The End of a Story:

No "better" result is possible in real-world, no better result is possible in resources-constrained parallel-computing-graph execution on real hardware.

Epilogue:

Having a single "just"-[CONCURRENT] element in an otherwise true-[PARALLEL] setup could be this expensive - here, fortunately all the costs of ~ x.000.000.000 USD and ~ 8760+ hours were magically just " free of charge ",
but
the real computing is never so forgiving
and the code will always pay them all, always ( design-wise, code-execution-wise ), so a due engineering efforts and proper design care is always in place not to get any << 1 speedup suprises.