Python: Why copying a Dask slice to Numpy array result in row count mismatch

Question

I am having error while copying a slice of dask array to nparray, the number of row doesn't match

store = h5py.File(s_file_path + '.hdf5', 'r')
dset = store['data_matrix']
data_matrix = da.from_array(dset, chunks=dset.chunks)
test_set = data_matrix[482:, :]
np_test_set = np.array(test_set, order='FORTRAN')

print "source_set shape: ", data_matrix.shape
print "test_set shape: ", test_set.shape
print "np_test_set shape: ", np_test_set.shape

results:

source_set shape:  (656, 473034)
test set shape:  (174, 473034)
np_test_set shape:  (195, 473034)

I am not very familiar with dask, I am using it because my data don't hold in RAM, is the row difference related to caching or the chunk size ?

score 3 · Answer 1 · answered Dec 24 '15 at 19:40

3

Typical ways to convert to numpy array

You can convert a dask.array to a numpy array by calling the .compute method

np_test_set = test_set.compute()

or by calling np.asarray

np_test_set = np.asarray(test_set)

Fortran ordering

In principle what you're doing now should work fine as well and so this may be a bug. The only part of this that seems atypical is specifying the Fortran order ahead of time. It would be interesting to see if changing this affects the result.

Additional information

If this is a genuine bug (as it appears it may be) then it would be good to raise an issue. It would be useful to also see the chunks of the dask.array.

answered Dec 24 '15 at 19:40

MRocklin

55,641
23
163
235

I was using the Fortran ordering as requirement for scikit ML function, I just reproduce my problem with both. np_test_set = test_set.compute() and np_test_set = np.asarray(test_set), the chunks are little strange I can't display all in this windows, it is like ((65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 6),(47, 47, 47, (many) , 47, 47, 47, 47, 26)) Thank for your help and happy christmas – user1946989 Dec 25 '15 at 09:06
I recommend raising an issue. If you can supply a reproducible example that will make it easier to track down what's going on. – MRocklin Dec 25 '15 at 19:47

score 0 · Accepted Answer · answered Dec 25 '15 at 09:22

0

I changed the chunk to (10, 500) and now it seems to work :

data_matrix = da.from_array(dset, chunks=(10,500))

answered Dec 25 '15 at 09:22

user1946989

377
1
4
16

Python: Why copying a Dask slice to Numpy array result in row count mismatch

2 Answers2

Typical ways to convert to numpy array

Fortran ordering

Additional information