I have around 6000 json.gz files totalling 24GB which I need to do various calculations on.
Because I have no clue about the number of lines I'm going to pick up from each JSON file, (since I would reject some lines with invalid data), I estimated a maximum of 2000
lines from each JSON.
I created a Memmapped Numpy with array shape (6000*2000,10)
and parsed data from the json.gz
into the Memmapped numpy [Total size=2.5GB]
In the end it turned out that because of the overestimation, the last 10-15% of the rows are all zeroes. Now because of the nature of my computation I need to remove these invalid rows from the Memmapped numpy. Priority is of course time and after that memory.
What could be the best method to do this? I'm programatically aware of the exact indices of the rows to be removed.
- Create another Memmaped array with the correct shape and size, slice the original array into this.
- Use the
delete()
function - Use Masking
- Something else?