Removing blank rows from Memmapped array

Question

I have around 6000 json.gz files totalling 24GB which I need to do various calculations on. Because I have no clue about the number of lines I'm going to pick up from each JSON file, (since I would reject some lines with invalid data), I estimated a maximum of 2000 lines from each JSON.

I created a Memmapped Numpy with array shape (6000*2000,10) and parsed data from the json.gz into the Memmapped numpy [Total size=2.5GB]

In the end it turned out that because of the overestimation, the last 10-15% of the rows are all zeroes. Now because of the nature of my computation I need to remove these invalid rows from the Memmapped numpy. Priority is of course time and after that memory.

What could be the best method to do this? I'm programatically aware of the exact indices of the rows to be removed.

Create another Memmaped array with the correct shape and size, slice the original array into this.
Use the delete() function
Use Masking
Something else?

Could you load into a normal NumPy array the right length from the outset, using np.fromiter (http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html)? — John Zwinck, Aug 29 '14 at 11:02

score 0 · Answer 1 · edited May 23 '17 at 12:05

You can use arr.base.resize to truncate or enlarge the array, then arr.flush() to save the change to disk:

In [169]: N = 10**6

In [170]: filename = '/tmp/test.dat'

In [171]: arr = np.memmap(filename, mode='w+', shape=(N,))

In [172]: arr.base.resize(N//2)

In [173]: arr.flush()

In [174]: arr = np.memmap(filename, mode='r')

In [175]: arr.shape
Out[175]: (500000,)

In [176]: arr.size == N//2
Out[176]: True

Removing blank rows from Memmapped array

1 Answers1