I'd like to store 1TB of random data backed by a zarr on disk array. Currently, I am doing something like the following:
import numpy as np
import zarr
from numcodecs import Blosc
compressor = Blosc(cname='lz4', clevel=5, shuffle=Blosc.BITSHUFFLE)
store = zarr.DirectoryStore('TB1.zarr')
root = zarr.group(store)
TB1 = root.zeros('data',
shape=(1_000_000, 1_000_000),
chunks=(20_000, 5_000),
compressor=compressor,
dtype='|i2')
for i in range(1_000_000):
TB1[i, :1_000_000] = np.random.randint(0, 3, size=1_000_000, dtype='|i2')
This is going to take some time -- I know things could probably be improved if I wasn't always generating 1_000_000
random numbers and instead reusing the array but I'd like some more randomness for now. Is there a better way to go about building this random dataset ?
Update 1
Using bigger numpy blocks speeds things up a bit:
for i in range(0, 1_000_000, 100_000):
TB1[i:i+100_000, :1_000_000] = np.random.randint(0, 3, size=(100_000, 1_000_000), dtype='|i2')