3

I am new to Python and am uncertain why I am seeing memory usage spike so dramatically when I use Numpy hstack to join together two pandas data frames. The performance with pandas.concat was even worse - if it would finish at all - so I am using NumPy.

The two data frames are relatively large, but I have 20 gb free RAM (using 11GB, including the two data frames I want to copy).

The data frames a and b have shapes:

a.shape (66377, 30)
b.shape (66377, 11100)

when I use np.hstack((a,b)) the free 20GB is had is completely used up.

Simon Kuang
  • 3,870
  • 4
  • 27
  • 53
B_Miner
  • 1,840
  • 4
  • 31
  • 66
  • what is the `dtype` of your data? For float64, `b` should be about 5.5 GB, so the results of `np.hstack` should only add about 5.5 GB as well. – JoshAdel May 23 '14 at 00:39

2 Answers2

2

np.hstack returns a new array containing a copy of the underlying data, so you've doubled your memory usage when you do this.

You can check the memory usage of each array using a.nbytes, b.nbytes, etc.

JoshAdel
  • 66,734
  • 27
  • 141
  • 140
1

As shown in this thread it is not possible to append an array in place and this would not be efficient since there is no guarantee to keep the extended array continguous in memory.

Python's garbage collector should free your memory if you delete the objects a and b after concatenating the arrays:

a = append(a, b, axis=1)
del b

if it does not free you can force it:

import gc
gc.collect()
Community
  • 1
  • 1
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • Calling gc will only do something on circular references, and you shouldn't have them with Numpy anyway. – Davidmh May 23 '14 at 11:40
  • @Davidmh you are right, but I got "adicted" to `gc.collect()` after I solved a memory leak in [this application](https://github.com/compmech/compmech/blob/master/compmech/conecyl/conecyl.py) using it. The leak was due to some issue with Cython+scipy.sparse.csr_matrix+Numpy that I still don't know... – Saullo G. P. Castro May 23 '14 at 11:49