Error occurring in caret when running on a cluster

Question

I am running the train function in caret on a cluster via doRedis. For the most part, it works, but every so often I get errors at the very end of this nature:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

and

Error in names(resamples) <- gsub("^\\.", "", names(resamples)) : 
  attempt to set an attribute on NULL

when I run traceback() I get:

5: nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method, 
       ppOpts = preProcess, ctrl = trControl, lev = classLevels, 
       ...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)
1: caret::train(couple ~ ., training.balanced, method = "nnet", 
       preProcess = "range", tuneGrid = nnetGrid, MaxNWts = 2200)

These errors are not easily reproducible (i.e. they happen sometimes, but not consistently) and only occur at the end of the run. The stdout on the cluster shows all tasks running and completed, so I am a bit flummoxed.

Has anyone encountered these errors? and if so understand the cause and even better a fix?

Eric C. · Answer 1 · 2015-01-29T18:47:39.830

I imagine you've already solved this problem, but I ran into the same issue on my cluster consisting of linux and windows systems. I was running the server on ubuntu 14.04 and had noticed the warnings when starting the server service about having 'transparent huge pages' enabled in the linux kernel. I ignored that message and began running training exercises where most of the machines were maxed out with workers. I received the same error at the end of the run:

error calling combine function:
<simpleError: obj$state$numResults <= obj$state$numValues is not TRUE>

After a lot of head scratching and useless tinkering, I decided to address that warning by following the instructions here: http://ubuntuforums.org/showthread.php?t=2255151

Essentially, I installed hugeadm using:

sudo apt-get install hugeadm

Then disabled the transparent pages using:

hugeadm --thp-never

Note that this change will be undone on restart of the computer.

When I re-ran my training process it ran without any errors.

Hope that helps.

Cheers, Eric

Links may disappear in the future. Please edit your answer to reflect the solution the link provides. You risk the deletion of the answer for just being a link-only answer if you don't. — Michael Haidl, Jan 28 '15 at 06:48
Unfortunately, I still get the error message even after the fix I suggested above, although much less frequently now. I have noticed that this error seems to occur when the workers are maxing out my LAN/Wifi home network. I've been able to reduce the occurrence of this problem even more by using fewer workers. Also, I can run successfully run jobs that fail if I change it so all of the workers are local to the rsession that initiated the job. I haven't tried this yet, but I also believe it would work fine if workers were run on the redis-server. — Eric C., Feb 20 '15 at 02:57

Error occurring in caret when running on a cluster

1 Answers1

Linked