0

[Note: although there are already some posts about dealing with large matrices in numpy, they do not address my specific concerns.]

I am trying to load a 30820x12801 matrix stored in a .txt file of size 1.02G with numpy.loadtxt(). I get a Memory error.

This wouldn't be so surprising except that:

  1. I am using 64bit Python.
  2. I am running the job in a supercomputer with 50G of virtual memory allocated for it.

From what I know, a 1G matrix shouldn't be a problem for 64bit Python, and certainly shouldn't be a problem for 50G of RAM.

(This is the first time I am dealing with large datasets so I may be missing something basic).


Extra information:

  • When using open() the file loads into Python without any problems.
  • Output of ulimit -a | grep "max memory size: '(kbytes, -m) unlimited'
  • Full error message:

    Traceback (most recent call last):
      File "jPCA/jPCA_pipeline.py", line 87, in <module>
        MATRIX = get_matrix(new_file_prefix, N)
      File "jPCA/jPCA_pipeline.py", line 70, in get_matrix
        MATRIX = np.loadtxt('{}_N={}.txt'.format(new_file_prefix, N))
      File "/home/hers_en/fsimoes/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1159, in loadtxt
        for x in read_data(_loadtxt_chunksize):
      File "/home/hers_en/fsimoes/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1087, in read_data
        items = [conv(val) for (conv, val) in zip(converters, vals)]
      File "/home/hers_en/fsimoes/miniconda3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 1087, in <listcomp>
        items = [conv(val) for (conv, val) in zip(converters, vals)]
    MemoryError

Soap
  • 309
  • 2
  • 14
  • Are you able to load the file into memory when not using Numpy? – Adomas Baliuka May 26 '20 at 11:50
  • What format is the text file? What dtype are you loading it as? Can you share the actual error message? – Seb May 26 '20 at 11:52
  • Could you also paste the output of `ulimit -a | grep "max memory size"` ? – Balaji Ambresh May 26 '20 at 11:57
  • @AdomasBaliuka Please see the extra information I added. – Soap May 26 '20 at 12:14
  • Also @AdomasBaliuka – Soap May 26 '20 at 12:15
  • And also @Seb (could not add all your names in one comment). – Soap May 26 '20 at 12:15
  • @Soap So `with open('/path/to/file.txt', 'rt') as f: contents = f.read()` does not cause a memory error? And again, what is the format of the file? CSV? – Seb May 26 '20 at 12:27
  • @Seb It's a tab delimited text file `filename.txt` and yes, I did what you wrote and it works, except I did not use the `'rt'` option (just used the default 'r'). – Soap May 26 '20 at 12:46
  • The error occurs when converting one line of the file (presumably to the default `float` dtype). I'd suggest testing the load with a `max_rows` to verify that it can load just a portion of the file. Lets make sure there isn't a problem with the delimiter and conversion format. – hpaulj May 26 '20 at 16:05
  • @hpaulj I've done that before with no problems (using only 1000 rows). – Soap May 26 '20 at 16:20
  • `loadtxt` collects the data as a list of lists before converting it to a numpy array. That means it's hard to measure or estimate how much memory it's using as it reads lines. As a wild shot, you might try `genfromtxt` instead, or `pandas`. – hpaulj May 26 '20 at 16:36

1 Answers1

0

It turns out that using 50G of virtual memory was sufficient: it turns out that my job was only using 10G before (not 200G after all - I had some problems with the job submission). After I solved this, 50G was enough but it still took approximately 6 hours to run.

This is still surprising for me because I would think that 10G would be more than enough to import a 1G matrix, but from the comments this has to do with the way loadtxt works.

Soap
  • 309
  • 2
  • 14