I am trying to merge a large data.frame
with a small one, and parallelise the computation. The code below works perfect, maximising all cores of my machine:
len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 3-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <- data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)
Once I change vector dat
to contain strings that are 5-long, parallelism breaks down, and although there is no error or warning, only 1 core is contributing to the computation:
len <- 2000000
set.seed(666)
dat = paste(sample(letters, len, rep = T), sample(0:9, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sample(letters, len, rep = T), sep = '') # create a vector of strings that are 6-long
head(dat)
set.seed(777)
num <- sample(0:9, len, replace = T)
bigDF <- data.frame(dat = dat, num = num)
smallDF <- data.frame(num = 0:9, caps = toupper(letters[1:10]))
startP <- 1
chunk <- 10000
nodes <- detectCores()
cl <- makeCluster(nodes)
registerDoParallel(cl)
mergedList <- foreach(i = 0:(len/chunk - 1)) %dopar% {
tmpDF = bigDF[(startP + i * chunk):(startP - 1 + (i + 1) * chunk), ]
merge(tmpDF, smallDF, by = 'num', all.x = T)
}
stopCluster(cl)
Why this inconsistency, and how could one work around it? In the particular example, if one indexes dat
to integers the code works. But indexing is not the answer in all cases. Why would the length of the strings matter to the number of cores utilised whatsoever?