25

I'm fitting a LDA model with lots of data using scikit-learn. Relevant code piece looks like this:

lda = LatentDirichletAllocation(n_topics = n_topics, 
                                max_iter = iters,
                                learning_method = 'online',
                                learning_offset = offset,
                                random_state = 0,
                                evaluate_every = 5,
                                n_jobs = 3,
                                verbose = 0)
lda.fit(X)

(I guess the only possibly relevant detail here is that I'm using multiple jobs.)

After some time I'm getting "No space left on device" error, even though there is plenty of space on the disk and plenty of free memory. I tried the same code several times, on two different computers (on my local machine and on a remote server), first using python3, then using python2, and each time I ended up with the same error.

If I run the same code on a smaller sample of data everything works fine.

The entire stack trace:

Failed to save <type 'numpy.ndarray'> to .npy file:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 271, in save
    obj, filename = self._write_array(obj, filename)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 231, in _write_array
    self.np.save(filename, array)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/npyio.py", line 491, in save
    pickle_kwargs=pickle_kwargs)
  File "/home/ubuntu/anaconda2/lib/python2.7/site-packages/numpy/lib/format.py", line 584, in write_array
    array.tofile(fp)
IOError: 275500 requested and 210934 written


IOErrorTraceback (most recent call last)
<ipython-input-7-6af7e7c9845f> in <module>()
      7                                 n_jobs = 3,
      8                                 verbose = 0)
----> 9 lda.fit(X)

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in fit(self, X, y)
    509                     for idx_slice in gen_batches(n_samples, batch_size):
    510                         self._em_step(X[idx_slice, :], total_samples=n_samples,
--> 511                                       batch_update=False, parallel=parallel)
    512                 else:
    513                     # batch update

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _em_step(self, X, total_samples, batch_update, parallel)
    403         # E-step
    404         _, suff_stats = self._e_step(X, cal_sstats=True, random_init=True,
--> 405                                      parallel=parallel)
    406 
    407         # M-step

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/online_lda.pyc in _e_step(self, X, cal_sstats, random_init, parallel)
    356                                               self.mean_change_tol, cal_sstats,
    357                                               random_state)
--> 358             for idx_slice in gen_even_slices(X.shape[0], n_jobs))
    359 
    360         # merge result

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    808                 # consumption.
    809                 self._iterating = False
--> 810             self.retrieve()
    811             # Make sure that we get a last message telling us we are done
    812             elapsed_time = time.time() - self._start_time

/home/ubuntu/anaconda2/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    725                 job = self._jobs.pop(0)
    726             try:
--> 727                 self._output.extend(job.get())
    728             except tuple(self.exceptions) as exception:
    729                 # Stop dispatching any new job in the async callback thread

/home/ubuntu/anaconda2/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    565             return self._value
    566         else:
--> 567             raise self._value
    568 
    569     def _set(self, i, obj):

IOError: [Errno 28] No space left on device
machaerus
  • 749
  • 1
  • 6
  • 20
  • It will probably work without multiprocessing (```n_jobs=1```). I'm not sure, which path scikit-learn is using for some temp-data. How big is your tmp partition? – sascha Oct 18 '16 at 19:38
  • Thanks @sascha, I'll try with one process only. If `tmpfs` is tmp partition (I think it is?) then it's 1.6GB. Can this be the problem? If so, is there any workaround for this? – machaerus Oct 18 '16 at 20:20

7 Answers7

36

Had the same problem with LatentDirichletAllocation. It seems, that your are running out of shared memory (/dev/shm when you run df -h). Try setting JOBLIB_TEMP_FOLDER environment variable to something different: e.g., to /tmp. In my case it has solved the problem.

Or just increase the size of the shared memory, if you have the appropriate rights for the machine you are training the LDA on.

silentser
  • 2,083
  • 2
  • 23
  • 29
  • 15
    This worked for me. Using iPython in a docker container, trying to validate a model like: `best_knn_clf = KNeighborsClassifier(weights='distance', n_neighbors=4, n_jobs=-1) scores = cross_val_score(best_knn_clf, X_train_expanded, y_train_expanded, cv=3, n_jobs=-1, verbose=3)`. Added `%env JOBLIB_TEMP_FOLDER=/tmp` in notebook did the trick. – Kallin Nagelberg Aug 25 '17 at 16:46
10

This problem occurs when shared memory is consumed and no I/O operation is permissible. This is a frustrating problem that occurs to most of the Kaggle users while fitting machine learning models.

I overcame this problem by setting JOBLIB_TEMP_FOLDER variable using following code.

%env JOBLIB_TEMP_FOLDER=/tmp
abhinav
  • 1,108
  • 11
  • 23
3

The solution of @silterser solved the problem for me.

If you want to set the environment variable in the code do this:

import os
os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'
Minions
  • 5,104
  • 5
  • 50
  • 91
1

Since joblib 1.3 you can use parallel_config to setup tmp folder

from joblib.parallel import parallel_config
with parallel_config(backend='threading', temp_folder='/tmp'):
   pass
0

This is because you have set n_jobs=3. You could set it to 1, then shared memory will not be used, even though learning will take longer time. You can chose to select a joblib cache dir as per above answer, but bear in mind that this cache can quickly fill up your disk as well, depending on the dataset? and disk transactions can slow your job down.

Artem Trunov
  • 1,340
  • 9
  • 16
0

I know it's kind of late, but I got over this problem by setting learning_method = 'batch'.

This could present other issues, such as extending training times, but it alleviated the problem of not having enough space on the shared memory.

Or maybe a smaller batch_size can be set. Although I have not tested this myself.

Billy.G
  • 71
  • 7
0

I had the same problem when running within Docker. Spent hours trying to solve the issue, turns out missing the permission to server NAS.

here are a few things you could do:

  • decrease batch size
  • add mem
  • check access to mem

BTW, if you are running on docker, the default shm space is 64mb, you need to specify space when running

add this to your docker run command: --shm-size=64g

eg: sudo docker run --gpus all --cpuset-cpus 0-63 --shm-size=64g -it YOUR_IMAGE

PIPI
  • 1
  • 1