3

I am trying to import a 1.25 GB dataset into python using dask.array

The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.

I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.

It doesn't matter what I do to the chunk size, I will always get a memory error.

I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array. Below I will supply a "as short as possible" code snippet:

I have tried two methods:

1) Create the full uint16 array and then try to convert to float32:

(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)

for i in range(lines):
    NewArray = da.concatenate([OldArray,Memmap],axis=0)
    OldArray = NewArray
return NewArray

and then I use

Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)

In method 2:

for i in range(lines):
    NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
    OldArray = NewArray
return NewArray

Both methods result in a memory error.

Is there any reason for this?

I read that dask array is capable of doing up to 100 GB dataset calculations.

I tried all chunk sizes (from as small as 10x10x10 to a single line)

user3666197
  • 1
  • 6
  • 50
  • 92
Amdixer
  • 91
  • 4

1 Answers1

1

You can create a dask.array from a numpy memmap array directly with the da.from_array function

x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)

You can change the dtype with the astype method

d = d.astype(np.float32)
MRocklin
  • 55,641
  • 23
  • 163
  • 235