0

I have some rather large datasets that I'm working with. Essentially, I'm running some of the tools from scikit-learn on memory-mapped numpy arrays as it seems to allow me to work with larger datasets than the memory on my computer would otherwise allow.

I sort of prefer the joblib to do the memory mapping, because you only have to specify the file.

But I can't seem to figure out how to allocate a new empty, say 100 million by 200 numpy array using only joblib without loading all into memory.

Thanks!

none
  • 1,187
  • 2
  • 13
  • 17

1 Answers1

0

I think you can do this by allocating a temporary array using np.memmap, then saving it using joblib.dump:

import numpy as np
from joblib import dump, load
import os

# allocate temporary memmaped array
init_pth = '/tmp/empty.mm'
mm = np.memmap(init_pth, dtype=np.double, mode='w+', shape=(1E8, 2E2))

# write some values to the first row
mm[0, :5] = np.arange(5)

# dump to joblib format
mmap_pth = '/tmp/test.mmap'
dump(mm, mmap_pth, compress=0)

# we can now delete the temporary array
os.remove(init_pth)

# load the memmap using joblib
mm2 = load(mmap_pth, mmap_mode='r+')

# print the first 5 values
print(mm2[0, :5])
# [ 0.  1.  2.  3.  4.]

This is rather inefficient, though, since it involves allocating a huge temporary array on disk then copying it.

ali_m
  • 71,714
  • 23
  • 223
  • 298
  • Right. However you still have to first run np.memmap, so you allocate 2 arrays instead of one. Is there a way around this? – none Jan 14 '15 at 18:18