Reading numpy array from file and parsing very slow

Question

I have a binary file and I am parsing it to a numpy array in Python like the following:

bytestream= np.fromfile(path, dtype=np.int16)

 for a in range(sizeA):
        for x in range(0, sizeX):
            for y in range(0, sizeY):
                for z in range(0, sizeZ):
                    parsed[a, x, y, z] = bytestream[z + (sizeZ * x) + (sizeZ * sizeX * y) + (sizeZ * sizeX * sizeY * a)]

However, this is very very slow. Can anyone tell me why and how to speed it up?

Zaw Lin · Accepted Answer · 2017-11-21T10:23:17.307

You seem to have made a mistake in your code, I believe x and y should be reversed in (sizeZ * x) + (sizeZ * sizeX * y) assuming row major ordering. In any case, check the code below, which verifies that reshape is what you want. The reason it's slow is because of the nested for loops.

In python, for loop is a very complicated construct with very significant overhead. Therefore in most cases, you should avoid for loops and use library provided functions(which also have for loops but done in c/c++). You will find that "removing the for loop" is a common question in numpy as most people will first attempt some algorithm they know in most straight forward way(convolution, max pooling for example). And realized that it's very slow and look for clever alternatives based on numpy api, where majority of computation shifted to c++ side instead of happening in python.

import numpy as np

# gen some data 
arr= (np.random.random((4,4,4,4))*10).astype(np.int16)
arr.tofile('test.bin')

# original code
bytestream=np.fromfile('test.bin',dtype=np.int16)
parsed=np.zeros(arr.shape,dtype=np.int16)
sizeA,sizeX,sizeY,sizeZ=arr.shape
for a in range(sizeA):
    for x in range(0, sizeX):
        for y in range(0, sizeY):
            for z in range(0, sizeZ):
                parsed[a, x, y, z] = bytestream[z + (sizeZ * y) + (sizeZ * sizeX * x) + (sizeZ * sizeX * sizeY * a)]

print(np.allclose(arr,parsed))
print(np.allclose(arr,bytestream.reshape((sizeA,sizeX,sizeY,sizeZ))))

That did not answer my question, why the code is slow, I think I should use reshape since it makes my code a lot faster. — , Nov 21 '17 at 10:16

score 0 · Answer 2 · answered Nov 22 '17 at 09:15

You're updating the numpy array parsed one cell by one cell, having to bounce back between python and the C implementation of numpy for each cell. This is a serious overhead. (not to mention the cost having to update the python variables a, y, x or z at each python iteration as zaw lin said or and the cost to compute the index)

Use numpy.copy, numpy.reshape, and numpy.moveaxis to numpy update as much values as it can in one batch when you're executing some numpy C code.

Thanks for the additional helpful comment! – Nov 22 '17 at 12:28 — , Nov 22 '17 at 12:28

Reading numpy array from file and parsing very slow

2 Answers2