1

I have an array of dimension (20000000, 247) of size around 30 GB in a .npy file. I have 32 GB available memory. I need to shuffle the data along rows. I have opened the file in mmap_mode. However, if I try anything other than in-place modification, for example np.random.permutation or creating a random.sampled array of indices p and then returning array[p], I get MemoryError. I have also tried shuffling the in chunks and then try stacking the chunks to build the full array, but MemoryError. The only solution I have found till now is loading the file in mmap_mode = 'r+' and then doing np.random.shuffle. However, it takes forever (it has been 5 hours still it's getting shuffled).

Current code:

import numpy as np
array = np.load('data.npy',mmap_mode='r+')
np.random.seed(1)
np.random.shuffle(array)

Is there any faster method to do this without breaking the memory constraint?

Sayandip Dutta
  • 15,602
  • 4
  • 23
  • 52
  • Why don't you shard your large file into multiple smaller ones of size 1 GB, for example? This way you can shuffle the data inside each shard and also shuffle shards at the end. If it is required, you can then merge shuffled shards in one large file. – carobnodrvo Sep 25 '19 at 11:43
  • @carobnodrvo I know, actually this is for training a neural network. I have an actual working code where I can read the data in chunk, shuffle each chunk and feed it to the network. However, now I have this particular constraint where I need all of the shuffled data in the same file. Shuffling individually and then appending does not work. – Sayandip Dutta Sep 25 '19 at 11:48
  • And the merging thing is causing `MemoryError` I am not sure if it is expected. – Sayandip Dutta Sep 25 '19 at 11:49
  • Oh I see, that is happening because you don't have enough memory to fit whole array + one chunk. – carobnodrvo Sep 25 '19 at 11:49
  • MemoryError, suppose I have merged 25 or so 1 GB files out of 30 into 1. Then when I load that file in mmap_mode and try to append another 1 GB file, it runs out of memory. – Sayandip Dutta Sep 25 '19 at 11:52
  • But what if you used smaller chunks? For example if your chunk is 0.1 GB it should and if not just decrease the size of the chunk. – carobnodrvo Sep 25 '19 at 11:54
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/199957/discussion-between-sayandip-dutta-and-carobnodrvo). – Sayandip Dutta Sep 25 '19 at 11:55

1 Answers1

0

Maybe not the best solution but this is what I depend on. Get the indices array then shuffle it and use that to fetch shuffled mem mapped numpy array. I assume that is better than waiting for 5 hours ;)

import numpy as np
array = np.load('data.npy',mmap_mode='r')
rows = array.shape[0]
indices = np.arange(rows)
np.random.seed(1)
np.random.shuffle(indices)

for i in range(rows):
    print(array[indices[i]])
Praveen Kulkarni
  • 2,816
  • 1
  • 23
  • 39