4

I am trying to merge a large data.frame with a small one, and parallelise the computation. The code below works perfect, maximising all cores of my machine:

len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 3-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <-  data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
    tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
    merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)

Once I change vector dat to contain strings that are 5-long, parallelism breaks down, and although there is no error or warning, only 1 core is contributing to the computation:

len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 6-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <-  data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
    tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
    merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)

Why this inconsistency, and how could one work around it? In the particular example, if one indexes dat to integers the code works. But indexing is not the answer in all cases. Why would the length of the strings matter to the number of cores utilised whatsoever?

Scarabee
  • 5,437
  • 5
  • 29
  • 55
Audrey
  • 212
  • 4
  • 15
  • Are the child R processes being spawned for the non-working cases? How is your free memory? – Jesse Anderson Sep 26 '14 at 16:58
  • It appears I can reproduce this problem on Win7-64bit, R3.1.1 . Tons of free RAM; the Rscript children never start up. More news later :-) – Carl Witthoft Sep 26 '14 at 17:21
  • @blindJesse I have gigabytes of free RAM, so not the issue. @Carl: My system specs are identical to yours. Note that, bizzarely, if `dat` consists of 4-long strings there is partial contribution to the computation by a second core. For 5-long strings and higher, only a single core is working. – Audrey Sep 29 '14 at 09:01

2 Answers2

4

I believe the difference is that in the first case, the first column of "bigDF" is a factor with 6,760 levels, while in the second case it has 1,983,234 levels. Having a huge number of levels can cause a number of performance problems. When I created "bigDF" with stringsAsFactors=FALSE, the performance was much better.

bigDF <- data.frame(dat=dat, num=num, stringsAsFactors=FALSE)

I also used the "isplitRows" function from the itertools package to avoid sending all of "bigDF" to each of the workers:

library(itertools)
mergedList <- foreach(splitDF=isplitRows(bigDF, chunkSize=chunk)) %dopar% {
    merge(splitDF, smallDF, by = 'num', all.x = T)
}

On my 6 core Linux machine running R 3.1.1, your second example ran in about 332 seconds. When I used stringsAsFactors=FALSE, it ran in about 50 seconds. When I also used isplitRows, the time went down to 5.5 seconds, or about 60 times faster than your second example.

Steve Weston
  • 19,197
  • 4
  • 59
  • 75
  • That may be part of it, since running either case NOT in parallel (using `%do%` instead of `%dopar%`) took almost no time to complete, on my i7 machine. Perhaps the thing that's taking all the time is allocating those factor levels to the slave cores. Guess we should do a profile, and try again but converting the column to "character" . – Carl Witthoft Sep 27 '14 at 12:09
  • Thanks Steve. iSplitRows is definitely worth a look. However, what I am specifically interested in is maximise computational output from all cores, rather than only reduce the system time. @CarlWitthoft : same applies to %do%, which I had done and is quicker. Characters are indeed quicker to compute than factors, but still only 1 core is deployed. – Audrey Sep 29 '14 at 08:55
  • Confirmed - `iSplitRows()` is a neat function but has no impact to the number of cores involved in the computation. – Audrey Sep 29 '14 at 10:12
  • 1
    Odd: when I try the original setup but convert `bigDF$dat` and `smallDF$caps` to `character` class, at most I get two cores going active. When I get a chance I'm going to try `mclapply` on these. – Carl Witthoft Sep 29 '14 at 12:10
  • @SteveWenston `stringsAsCharacters = F` seems to work for me, with all cores engaging!? Limited only by the cost of character strings (as opposed to factors) on resources I suppose. – Audrey Sep 29 '14 at 21:35
2

Not an answer yet, but: If I run your code but using %do% so as not to parallelize, I get identical (successful) results for the two cases except of course for the dat names. Same if I run the short names with %dopar% and the long names with %do% .

This is beginning to look like a subtle bug in one of the supporting packages, so you might want to ping the developers on this one.

Update 29Sept: I ran what I believe is the same setup but using ClusterMap:

dffunc <-function(i=i,bigDF=bigDF,smallDF=smallDF,startP=startP,chunk=chunk) {
tmpDF <- bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
    merge(tmpDF, smallDF, by = 'num', all.x = T)
    }


clusmerge<- clusterMap(cl,  function(i) {dffunc(i=i)}, 0:(len/chunk-1),MoreArgs=list(bigDF=bigDF,smallDF=smallDF,startP=startP,chunk=chunk) )

And in this case I get all the nodes up and running regardless of the length of the dat name strings. I'm back to suspecting there's some bug in %dopar% or elsewhere in the foreach package.

As a side note, may I recommend against doing

nodes <- detectCores()
cl <- makeCluster(nodes)

As that can hang your entire machine. Better cl <- makeCluster(nodes-1) :-)

Carl Witthoft
  • 20,573
  • 9
  • 43
  • 73
  • +1 for the sensible `makeCluster(nodes-1)` :-). clusterMap() gives me an `Error in checkForRemoteErrors(val)`. Just run the code with `bigDF <- data.frame(dat = dat, num = num, stringsAsFactors = F)` and all cores seemed to engage as @SteveWeston suggests, see comment below. – Audrey Sep 29 '14 at 21:40
  • I've never heard of `makeCluster(detectNodes())` hanging a Linux or Mac. Since the master is not performing any computations, it can make a lot of sense to start one worker per core, which is what "mclapply" did by default in the multicore package. Are you saying that it can hang the call to "makeCluster" or to the subsequent parallel operation? And have you seen a hang on anything other than Windows? – Steve Weston Sep 29 '14 at 23:43
  • @SteveWeston I exaggerated *slightly* : since the Rscript cluster is hogging 99.99% of the CPU available, pretty much everything else is "on hold" waiting for a chance to get a few cycles. Yes, the machine returns to normal when the cluster is done, but in the meantime most processes have to wait, and wait, and wait... (cue Rick's Cafe) – Carl Witthoft Sep 30 '14 at 11:30
  • I see what you mean. Using 99.99% of the cores on a cluster or dedicated workstation is considered a good thing, but it's rather annoying on your personal laptop. – Steve Weston Sep 30 '14 at 12:10