34

I have a very simple script in Python, but for some reason I get the following error when running a large amount of data:

*** glibc detected *** python: double free or corruption (out): 0x00002af5a00cc010 ***

I am used to these errors coming up in C or C++, when one tries to free memory that has already been freed. However, by my understanding of Python (and especially the way I've written the code), I really don't understand why this should happen.

Here is the code:

#!/usr/bin/python -tt                                                                                                                                                                                                                         

import sys, commands, string
import numpy as np
import scipy.io as io
from time import clock

W = io.loadmat(sys.argv[1])['W']
size = W.shape[0]
numlabels = int(sys.argv[2])
Q = np.zeros((size, numlabels), dtype=np.double)
P = np.zeros((size, numlabels), dtype=np.double)
Q += 1.0 / Q.shape[1]
nu = 0.001
mu = 0.01
start = clock()
mat = -nu + mu*(W*(np.log(Q)-1))
end = clock()
print >> sys.stderr, "Time taken to compute matrix: %.2f seconds"%(end-start)

One may ask, why declare a P and a Q numpy array? I simply do that to reflect the actual conditions (as this code is simply a segment of what I actually do, where I need a P matrix and declare it beforehand).

I have access to a 192GB machine, and so I tested this out on a very large SciPy sparse matrix (2.2 million by 2.2 million, but very sparse, that's not the issue). The main memory is taken up by the Q, P, and mat matrices, as they are all 2.2 million by 2000 matrices (size = 2.2 million, numlabels = 2000). The peak memory goes up to 131GB, which comfortably fits in memory. While the mat matrix is being computed, I get the glibc error, and my process automatically goes into the sleep (S) state, without deallocating the 131GB it has taken up.

Given the bizarre (for Python) error (I am not explicitly deallocating anything), and the fact that this works nicely for smaller matrix sizes (around 1.5 million by 2000), I am really not sure where to start to debug this.

As a starting point, I have set "ulimit -s unlimited" before running, but to no avail.

Any help or insight into numpy's behavior with really large amounts of data would be welcome.

Note that this is NOT an out of memory error - I have 196GB, and my process reaches around 131GB and stays there for some time before giving the error below.

Update: February 16, 2013 (1:10 PM PST):

As per suggestions, I ran Python with GDB. Interestingly, on one GDB run I forgot to set the stack size limit to "unlimited", and got the following output:

*** glibc detected *** /usr/bin/python: munmap_chunk(): invalid pointer: 0x00007fe7508a9010 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x733b6)[0x7ffff6ec23b6]
/usr/lib64/python2.7/site-packages/numpy/core/multiarray.so(+0x4a496)[0x7ffff69fc496]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4e67)[0x7ffff7af48c7]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x309)[0x7ffff7af6c49]
/usr/lib64/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7ffff7b25592]
/usr/lib64/libpython2.7.so.1.0(+0xfcc61)[0x7ffff7b33c61]
/usr/lib64/libpython2.7.so.1.0(PyRun_FileExFlags+0x84)[0x7ffff7b34074]
/usr/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x189)[0x7ffff7b347c9]
/usr/lib64/libpython2.7.so.1.0(Py_Main+0x36c)[0x7ffff7b3e1bc]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7ffff6e6dbfd]
/usr/bin/python[0x4006e9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 09:01 50336181                           /usr/bin/python2.7
00600000-00601000 r--p 00000000 09:01 50336181                           /usr/bin/python2.7
00601000-00602000 rw-p 00001000 09:01 50336181                           /usr/bin/python2.7
00602000-00e5f000 rw-p 00000000 00:00 0                                  [heap]
7fdf2584c000-7ffff0a66000 rw-p 00000000 00:00 0 
7ffff0a66000-7ffff0a6b000 r-xp 00000000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0a6b000-7ffff0c6a000 ---p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6a000-7ffff0c6b000 r--p 00004000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6b000-7ffff0c6c000 rw-p 00005000 09:01 50333916                   /usr/lib64/python2.7/lib-dynload/mmap.so
7ffff0c6c000-7ffff0c77000 r-xp 00000000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0c77000-7ffff0e76000 ---p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e76000-7ffff0e77000 r--p 0000a000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e77000-7ffff0e78000 rw-p 0000b000 00:12 54138483                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/streams.so
7ffff0e78000-7ffff0e79000 rw-p 00000000 00:00 0 
7ffff0e79000-7ffff0e9b000 r-xp 00000000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff0e9b000-7ffff109a000 ---p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109a000-7ffff109b000 r--p 00021000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109b000-7ffff109f000 rw-p 00022000 00:12 54138481                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio5_utils.so
7ffff109f000-7ffff10a0000 rw-p 00000000 00:00 0 
7ffff10a0000-7ffff10a5000 r-xp 00000000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff10a5000-7ffff12a4000 ---p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a4000-7ffff12a5000 r--p 00004000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a5000-7ffff12a7000 rw-p 00005000 09:01 50333895                   /usr/lib64/python2.7/lib-dynload/zlib.so
7ffff12a7000-7ffff12ad000 r-xp 00000000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff12ad000-7ffff14ac000 ---p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ac000-7ffff14ad000 r--p 00005000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ad000-7ffff14ae000 rw-p 00006000 00:12 54138491                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/io/matlab/mio_utils.so
7ffff14ae000-7ffff14b5000 r-xp 00000000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff14b5000-7ffff16b4000 ---p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b4000-7ffff16b5000 r--p 00006000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b5000-7ffff16b6000 rw-p 00007000 00:12 54138562                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_csgraph.so
7ffff16b6000-7ffff17c2000 r-xp 00000000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff17c2000-7ffff19c2000 ---p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c2000-7ffff19c3000 r--p 0010c000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c3000-7ffff19c6000 rw-p 0010d000 00:12 54138558                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_bsr.so
7ffff19c6000-7ffff19d5000 r-xp 00000000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff19d5000-7ffff1bd4000 ---p 0000f000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
7ffff1bd4000-7ffff1bd5000 r--p 0000e000 00:12 54138561                   /home/avneesh/.local/lib/python2.7/site-packages/scipy/sparse/sparsetools/_dia.so
Program received signal SIGABRT, Aborted.
0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff6e81ab5 in raise () from /lib64/libc.so.6
#1  0x00007ffff6e82fb6 in abort () from /lib64/libc.so.6
#2  0x00007ffff6ebcdd3 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ec23b6 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69fc496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#5  0x00007ffff7af48c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#6  0x00007ffff7af6c49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#7  0x00007ffff7b25592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#8  0x00007ffff7b33c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#9  0x00007ffff7b34074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#10 0x00007ffff7b347c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007ffff7b3e1bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007ffff6e6dbfd in __libc_start_main () from /lib64/libc.so.6
#13 0x00000000004006e9 in _start ()

When I set the stack size limit to unlimited", I get the following:

*** glibc detected *** /usr/bin/python: double free or corruption (out): 0x00002abb2732c010 ***
^X^C
Program received signal SIGINT, Interrupt.
0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
(gdb) bt
#0  0x00002aaaab9d08fe in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x00002aaaab969f2e in _L_lock_9927 () from /lib64/libc.so.6
#2  0x00002aaaab9682d1 in free () from /lib64/libc.so.6
#3  0x00002aaaaaabbfe2 in _dl_scope_free () from /lib64/ld-linux-x86-64.so.2
#4  0x00002aaaaaab70a4 in _dl_map_object_deps () from /lib64/ld-linux-x86-64.so.2
#5  0x00002aaaaaabcaa0 in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#6  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#7  0x00002aaaaaabc5da in _dl_open () from /lib64/ld-linux-x86-64.so.2
#8  0x00002aaaab9fb530 in do_dlopen () from /lib64/libc.so.6
#9  0x00002aaaaaab85f6 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#10 0x00002aaaab9fb5cf in dlerror_run () from /lib64/libc.so.6
#11 0x00002aaaab9fb637 in __libc_dlopen_mode () from /lib64/libc.so.6
#12 0x00002aaaab9d60c5 in init () from /lib64/libc.so.6
#13 0x00002aaaab080933 in pthread_once () from /lib64/libpthread.so.0
#14 0x00002aaaab9d61bc in backtrace () from /lib64/libc.so.6
#15 0x00002aaaab95dde7 in __libc_message () from /lib64/libc.so.6
#16 0x00002aaaab9633b6 in malloc_printerr () from /lib64/libc.so.6
#17 0x00002aaaab9682dc in free () from /lib64/libc.so.6
#18 0x00002aaaabef1496 in ?? () from /usr/lib64/python2.7/site-packages/numpy/core/multiarray.so
#19 0x00002aaaaad888c7 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#20 0x00002aaaaad8ac49 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#21 0x00002aaaaadb9592 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#22 0x00002aaaaadc7c61 in ?? () from /usr/lib64/libpython2.7.so.1.0
#23 0x00002aaaaadc8074 in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#24 0x00002aaaaadc87c9 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#25 0x00002aaaaadd21bc in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#26 0x00002aaaab90ebfd in __libc_start_main () from /lib64/libc.so.6
#27 0x00000000004006e9 in _start ()

This makes me believe the basic issue is with the numpy multiarray core module (line #4 in the first output and line #18 in the second). I will bring it up as a bug report in both numpy and scipy just in case.

Has anyone seen this before?

Update: February 17, 2013 (4:45 PM PST)

I found a machine that I could run the code on that had a more recent version of SciPy (0.11) and NumPy (1.7.0). Running the code straight up (without GDB) resulted in a seg fault without any output to stdout or stderr. Running again through GDB, I get the following:

Program received signal SIGSEGV, Segmentation fault.
0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00002aaaabead970 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaac5fcd04 in PyDataMem_FREE (ptr=<optimized out>, $K8=<optimized out>) at numpy/core/src/multiarray/multiarraymodule.c:3510
#2  array_dealloc (self=0xc00ab7edbfc228fe) at numpy/core/src/multiarray/arrayobject.c:416
#3  0x0000000000498eac in PyEval_EvalFrameEx ()
#4  0x000000000049f1c0 in PyEval_EvalCodeEx ()
#5  0x00000000004a9081 in PyRun_FileExFlags ()
#6  0x00000000004a9311 in PyRun_SimpleFileExFlags ()
#7  0x00000000004aa8bd in Py_Main ()
#8  0x00002aaaabe4f76d in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x000000000041b9b1 in _start ()

I understand this is not as useful as a NumPy compiled with debugging symbols, I will try doing that and post the output later.

Avneesh
  • 511
  • 1
  • 4
  • 13
  • 4
    Interesting -- Although, you might be better off filing a bug report with `scipy`. Perhaps compiling with `-g` and running under `gdb` would be helpful as well -- At least then you could see where the code was when the error happened... – mgilson Feb 16 '13 at 04:33
  • what OS are you using? (and what is this data?) – tacaswell Feb 16 '13 at 04:33
  • 3
    What Python/numpy version are you using? I see some bug reports about this, but they are old and suggest the problem is now fixed. – BrenBarn Feb 16 '13 at 04:36
  • BrenBam and tcaswell: Numpy version is 1.5.1, scipy version is 0.9.0, and Python version is 2.7. OS is Linux openSUSE 11.4 x86-64 bit. The data (W) is a similarity matrix between audio frames for a speech recognition task. The P and Q matrices represent, for each frame (row), the distribution over all possible labels (columns). This will then go downstream into a speech recognition decoder. I will rerun on gdb and edit the post with the output. I'll also post on the scipy list. If you have any insight into these errors on the versions I provided, do let me know. – Avneesh Feb 16 '13 at 20:25
  • For those interested in tracking this bug, I have filed tickets with both SciPy (http://projects.scipy.org/scipy/ticket/1846) and NumPy (https://github.com/numpy/numpy/issues/2995) – Avneesh Feb 16 '13 at 21:59
  • Try building your own NumPy with debugging symbols enabled, as mgilson suggests. That will make the gdb output more useful. – nneonneo Feb 17 '13 at 00:58
  • 2
    Also, if I'm not mistaken, Numpy 1.5.1 is quite far out of date (more than two years old). Have you tried upgrading Numpy? – nneonneo Feb 17 '13 at 00:59
  • 2
    That's a bug (probably in a C extension). However, the place it showed up is usually not the place it broke but rather much later (so the backtrace shows where the error was finally detected and not where it was made). Try valgrind --tool=memcheck python xxx.py in order to find out where exactly it is. – Danny Milosavljevic Feb 17 '13 at 16:11
  • Running valgrind did not really produce any interpretable output, at least to me. I basically get a bunch of "Invalid read of size 4" and a few other errors. Is there anything I should be looking for? I will try a more recent Numpy version and get back to you. Recompiling Numpy can also be done, but I'll try the easier check first. – Avneesh Feb 18 '13 at 00:29
  • Have a look at my latest update - something is wrong in #2, with array_dealloc (there is also a parallel discussion on the issue tracker in numpy's github page). I'm having some trouble building and installing my own numpy, but hopefully the results on the most recent version may give some insight. – Avneesh Feb 18 '13 at 03:18
  • 2
    I seem to remember a glibc problem with more than 131072 MB (0x2000000000 bytes) of heap. The same error surfaced in Xen heap size calculations, ages ago. Given that your peak usage is "131 GB", it could conceivably be related. What happens with slightly *larger* matrices? They will crash, but does this still happen at 131 GB, or later? If there is a barrier at 128 GiB, it might not be a bug of numpy at all. – LSerni Feb 18 '13 at 11:55

2 Answers2

6

After discussions on the same issue on the Numpy Github page (https://github.com/numpy/numpy/issues/2995) it has been brought to my attention that Numpy/Scipy will not support such a large number of non-zeros in the resulting sparse matrix.

Basically, W is a sparse matrix, and Q (or np.log(Q)-1) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense). However, note that since I have no zero rows in my W matrix, the resulting product W*(np.log(Q)-1) will have nnz > 2^31 (2.2 million multiplied by 2000) and this exceeds the maximum number of elements in a sparse matrix in current versions of Scipy.

At this stage, I'm not sure how else to get this to work, barring a re-implementation in another language. Perhaps it can still be done in Python, but it might be better to just write up a C++ and Eigen implementation.

A special thanks to pv. for helping out on this to pinpoint the exact issue, and thanks to everyone else for the brainstorming!

Community
  • 1
  • 1
Avneesh
  • 511
  • 1
  • 4
  • 13
  • It seems to me that if it's reasonable for you to have a dense matrix as large as `W*(np.log(Q)-1)`, it may well _also_ be reasonable for you to have a dense matrix as large as `W`. In which case, there's an easy answer: `W.todense()*(np.log(Q)-1)`. – abarnert Feb 18 '13 at 19:35
  • No I wouldn't say that is true. ```W*(np.log(Q)-1)``` is ```2.2 million x 2000```, and ```W.todense()``` is `2.2 million x 2.2 million```. – Avneesh Feb 19 '13 at 17:32
1

Basically, W is a sparse matrix, and Q (or np.log(Q)-1) is a dense matrix. When multiplying a dense matrix with a sparse one, the resulting product will also be represented in sparse matrix form (which makes a lot of sense).

I'm probably missing something really obvious here, and will end up looking like an idiot, but…

If Q is a dense matrix, and you're hoping to store the result as a dense matrix, you probably have enough to hold W as a dense matrix as well. Which means:

W.todense()*(np.log(Q)-1)

Looking at the details, as you calculated in the comments, this would require 35.8GB of temporary memory. Given that you've got 131GB of data and this "fits comfortably into memory", it seems at least plausible that temporarily using another 35.8GB would be reasonable.

If it's not reasonable, you can always decompose the matrix multiplication yourself. Obviously doing it row by row or column by column would make your entire process much slower (maybe not as much as pushing the process over the edge into swapping, but still maybe far too slow to be acceptable). But doing it in, e.g., a chunk of 1GB worth of rows at a time shouldn't be too bad. That would mean temporary storage on the order of a few GB, and probably only a small slowdown. Of course it's more complicated and ugly code, but not unmanageably so.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I don't think what you said works out. ```W``` is an ```N*N``` matrix, where ```N``` is 2.2 million, not an ```N*M``` matrix, and ```Q``` is a ```N*M``` matrix, where ```M = 2000```. Converting the sparse matrix, which currently has 42,666,354 non-zeros (and therefore total memory = ```(42666354*8)/(2^20) = ~326MB```), to a dense matrix would mean ```(2.2 million * 2.2 million * 8)/(2^30) ~ 35.8GB```. This increases memory significantly. – Avneesh Feb 19 '13 at 17:19
  • In addition, the original plan was to scale this up to ```N = 7.2 million```, which would make the ```W.todense()``` solution less feasible. – Avneesh Feb 19 '13 at 17:27
  • @Avneesh: I realize it means more memory, but my assumption was that if 131GB "fits comfortably into memory", then another 35.8GB of temporary storage (for the duration of this multiplication) may be acceptable. Of course it may _not_ be acceptable, but it seemed like it was at least trying. – abarnert Feb 19 '13 at 19:14
  • @Avneesh: Meanwhile, you can break a matrix multiplication up into pieces, and densify pieces of the `W` matrix a bit at a time. Obviously iterating row by row will take ridiculously long, but iterating 1GB worth at a time should only be a little slower than doing it all at once, and only require a few GB of temporary storage instead of 36GB (and more importantly, stay near-constant instead of scaling with N). – abarnert Feb 19 '13 at 19:17
  • Hi abamert, thanks for your suggestions. I realize that it is definitely doable in Python, but I feel switching implementation would be easier overall. The above example I gave is one of the smaller problem sizes I would be dealing with, in fact. – Avneesh Feb 24 '13 at 05:26
  • @Avneesh: If you're considering reimplementing your entire project in C++ as a workaround, have you considered that it might be just as easy to just add sparse*dense=dense multiplication to `numpy`? – abarnert Feb 25 '13 at 19:02