Memory issue sklearn pairwise_distances calculation

Question

I have a large data frame where its index is movie_id and column headers represent tag_id. Each row is represent movie to tag relevance

                     639755209030196  691838465332800  \
46126718359              0.042             0.245
46130382440              0.403             0.3
46151724544              0.032             0.04

Then I do following:

data = df.values
similarity_matrix = 1 - pairwise_distances(data, data, 'cosine', -2)

It has close to 8000 of unique tags so the shape of the data is 42588 * 8000. Prior to above line of the code I delete all un-necessary data object to free up any memory. And I am getting this error in a machine that has 40Gigs of memory.

Exception in thread Thread-4:
Traceback (most recent call last):
  File "~/anaconda/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "~/anaconda/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
    pool._maintain_pool()
  File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
    self._repopulate_pool()
  File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
    w.start()
  File "~/anaconda/lib/python2.7/multiprocessing/process.py", line 130, in start
    self._popen = Popen(self)
  File "~/anaconda/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

What can be the reason? Is the matrix is too large? What are my options on avoiding this memory problem?

I am currently using:

python 2.7
scikit-learn              0.15.2               np19py27_0
Red-Hat Linux with 4X4 cores x86_64

What Operating System are you using? Also, what version of Python, i.e. 32 or 64 bit? — Aleksander Lidtke, May 07 '15 at 18:28

score 1 · Accepted Answer · answered May 07 '15 at 18:15

1

What version of scikit-learn are you using? And does it run with n_jobs=1? The result should fit in memory, it is 8 * 42588 ** 2 / 1024 ** 3 = 13 Gb. But the data is about 2gb, and will be replicated to each core. So if you have 16 cores, you will run into trouble.

answered May 07 '15 at 18:15

Andreas Mueller

27,470
8
62
74

I just double check and its a 16 core, so that explains the problem I guess. I also updated the version of python, scikit-learn in the question – add-semi-colons May 07 '15 at 19:41
1

@Null-Hypothesis Have you solve the problem? I meet the same issue. – Munichong Jul 08 '15 at 21:57
Use less cores or buy more memory ;) – Andreas Mueller Jul 09 '15 at 16:24

Memory issue sklearn pairwise_distances calculation

1 Answers1