I have a large data frame where its index is movie_id and column headers represent tag_id. Each row is represent movie to tag relevance
639755209030196 691838465332800 \
46126718359 0.042 0.245
46130382440 0.403 0.3
46151724544 0.032 0.04
Then I do following:
data = df.values
similarity_matrix = 1 - pairwise_distances(data, data, 'cosine', -2)
It has close to 8000 of unique tags so the shape of the data is 42588 * 8000. Prior to above line of the code I delete all un-necessary data object to free up any memory. And I am getting this error in a machine that has 40Gigs of memory.
Exception in thread Thread-4:
Traceback (most recent call last):
File "~/anaconda/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "~/anaconda/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 326, in _handle_workers
pool._maintain_pool()
File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 230, in _maintain_pool
self._repopulate_pool()
File "~/anaconda/lib/python2.7/multiprocessing/pool.py", line 223, in _repopulate_pool
w.start()
File "~/anaconda/lib/python2.7/multiprocessing/process.py", line 130, in start
self._popen = Popen(self)
File "~/anaconda/lib/python2.7/multiprocessing/forking.py", line 121, in __init__
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
What can be the reason? Is the matrix is too large? What are my options on avoiding this memory problem?
I am currently using:
python 2.7
scikit-learn 0.15.2 np19py27_0
Red-Hat Linux with 4X4 cores x86_64