0

I have a problem understanding the way numpy.memmap works. The background is that I need to reduce a large numpy array saved on disc by deleting entries. Reading in the array and building up a new one by copying the desired parts doesn't work - it just doesn't fit into memory. So the idea is to use numpy.memmap - i.e. working on disc. Her is my code (with a small file):

import numpy

in_file = './in.npy'
in_len = 10
out_file = './out.npy'
out_len = 5

# Set up input dummy-file
dummy_in = numpy.zeros(shape=(in_len,1),dtype=numpy.dtype('uint32'))
for i in range(in_len):
    dummy_in[i] = i + i
numpy.save(in_file, dummy_in)

# get dtype and shape from the in_file
in_npy = numpy.load(in_file)

in_dtype = in_npy.dtype
in_shape = (in_npy.shape[0],1)
del(in_npy)

# generate an 'empty' out_file with the desired dtype and shape
out_shape = (out_len,1)
out_npy = numpy.zeros(shape=out_shape, dtype=in_dtype)
numpy.save(out_file, out_npy)
del(out_npy)

# memmap both files
in_memmap = numpy.memmap( in_file,  mode='r',  shape=in_shape, dtype=in_dtype)
out_memmap = numpy.memmap(out_file, mode='r+', shape=out_shape, dtype=in_dtype)
print "in_memmap"
print in_memmap, "\n"
print "out_memmap before in_memmap copy"
print out_memmap, "\n"

# copy some parts
for i in range(out_len):
    out_memmap[i] = in_memmap[i]

print "out_memmap after in_memmap copy"
print out_memmap, "\n"
out_memmap.flush()

# test
in_data = numpy.load(in_file)
print "in.npy"
print in_data
print in_data.dtype, "\n"

out_data = numpy.load(out_file)
print "out.npy"
print out_data
print out_data.dtype, "\n"

Running this code I get:

in_memmap
[[1297436307]
 [     88400]
 [ 662372422]
 [1668506980]
 [ 540682098]
 [ 880098343]
 [ 656419879]
 [1953656678]
 [1601069426]
 [1701081711]]

out_memmap before in_memmap copy
[[1297436307]
 [     88400]
 [ 662372422]
 [1668506980]
 [ 540682098]]

out_memmap after in_memmap copy
[[1297436307]
 [     88400]
 [ 662372422]
 [1668506980]
 [ 540682098]]

in.npy
[[ 0]
 [ 2]
 [ 4]
 [ 6]
 [ 8]
 [10]
 [12]
 [14]
 [16]
 [18]]
uint32

out.npy
[[0]
 [0]
 [0]
 [0]
 [0]]
uint32

Form the output it is clear that I'm doing something wrong:

1) The memmaps don't contain the values set in the arrays, and in_memmap and out_memmap contain the same values.

2) It is not clear if the copy command copies anything from in_memmap to out_memmap (due to identical values). Checking in debug mode the values of in_memmap[i] and out_memmap[i] I get for both: memmap([1297436307], dtype=uint32). So can I assign them as in the code or do I have to use: out_memmap[i][0] = in_memmap[i][0]?

3) out.npy isn't updated to the out_memmap values by the flush() operation.

Can anyone please help me to understand what I'm doing wrong here.

Thanks a lot

Daniel F
  • 13,620
  • 2
  • 29
  • 55
fdiehl
  • 1
  • Your problem seems to be `np.save` and `np.memmap` have slightly different formats. Check [this](https://stackoverflow.com/questions/23062674/numpy-memmap-map-to-save-file) answer out – Daniel F Aug 08 '17 at 12:12
  • Also, if you're regularly using arrays bigger than your RAM can handle, check out [dask](https://dask.pydata.org/en/latest/) – Daniel F Aug 08 '17 at 12:17

1 Answers1

0

Replace every instance of np.memmap with np.lib.format.open_memmap and get:

in_memmap 
[[ 0]
 [ 2]
 [ 4]
 [ 6]
 [ 8]
 [10]
 [12]
 [14]
 [16]
 [18]] 

out_memmap before in_memmap copy 
[[0]
 [0]
 [0]
 [0]
 [0]] 

out_memmap after in_memmap copy 
[[0]
 [2]
 [4]
 [6]
 [8]] 

in.npy 
[[ 0]
 [ 2]
 [ 4]
 [ 6]
 [ 8]
 [10]
 [12]
 [14]
 [16]
 [18]] 
 uint32 

out.npy 
[[0]
 [2]
 [4]
 [6]
 [8]] 
 uint32 

np.save adds a header that np.memmap was reading, which is why the data in both looked the same (since it's the same header). It's also why when you copied data from one to the other it had no effect (since it was only copying the headers, not the data) np.lib.format.open_memmap automatically skips the header so you can work on the data.

Daniel F
  • 13,620
  • 2
  • 29
  • 55