2

I have a huge .csv file (~2GB) that I import in my program with read_csvand then convert to an numpy matrix with as_matrix. The generated matrix has the form like the data_mat in the given example below. My problem is now, that I need to extract the blocks with the same uuid4 (entry in the first column of the matrix). The submatrices are then processed by another function. It seems that my example below is not the best way doing this. Faster methods are welcome.

import numpy as np
data_mat = np.array([['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 4, 3, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 1, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 3, 1],\
                     ['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 6, 1, 1],\
                     ['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 4, 1],\
                     ['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 1, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 2, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 9, 0],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 0],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 5, 1, 1],\
                     ['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 1],\
                     ['d3a8a9d0-4380-42e3-b35f-733a9f9770da', 3, 6, 10]],dtype=object)

unique_ids, indices = np.unique(data_mat[:,0],return_index=True,axis=None)
length = len(data_mat)
i=0
for idd in unique_ids:
    index = indices[i]
    k=0
    while ((index+k)<length and idd == data_mat[index+k,0]):        
        k+=1
    tmp_mat=data_mat[index:(index+k),:]
    # do something with tmp_mat ...
    print(tmp_mat)
    i+=1
nerdizzle
  • 424
  • 4
  • 17
  • Probably the best approach is doing it within pandas, which is highly optimized for this. [groupby](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) is what you want. Also see [this](https://github.com/numpy/numpy/issues/7265) for some discussion about numpy's missing functionality here (and potentially praising pandas). – sascha Sep 03 '17 at 13:09
  • are the unique ids always grouped together like that (i.e sorted by `data_mat[:,0]` or otherwise)? – Daniel F Sep 03 '17 at 13:10
  • yes the unique ids are always grouped together like that. I should have mentioned that.. – nerdizzle Sep 03 '17 at 13:25

3 Answers3

2

To optimize the idea would be to minimize the computations once we are inside the loop. So, with that in mind, we would rearrange the rows of the array, sorted by the first column. Then, get the indices that define the boundaries. Finally, start our loop and simply slice for each group to get a submatrix at each iteration. Slicing is virtually free when working with arrays, so that should help us.

Thus, one implementation would be -

a0 = data_mat[:,0]
sidx = a0.argsort()
sd = data_mat[sidx] # sorted data_mat
idx = np.flatnonzero(np.concatenate(( [True], sd[1:,0] != sd[:-1,0], [True] )))
for i,j in zip(idx[:-1], idx[1:]):
    tmp_mat = sd[i:j]
    print tmp_mat

If you are looking to store each submatrix as an array to have a list of arrays as the final output, simply do -

[sd[i:j] for i,j in zip(idx[:-1], idx[1:])]

For sorted data_mat

For a case with data_mat already being sorted as shown in the sample, we could avoid sorting the entire array and directly use the first column, like so -

a0 = data_mat[:,0]
idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))

for i,j in zip(idx[:-1], idx[1:]):
    tmp_mat = data_mat[i:j]
    print(tmp_mat)

Again, to get all those submatrices as a list of arrays, use -

[data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]

Note that the submatrices that we would get with this one would be in a different order than with the sorting done in the previous approach.


Benchmarking for sorted data_mat

Approaches -

# @Daniel F's soln-2
def split_app(data_mat):
    idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
    return np.split(data_mat, idx)

# Proposed in this post
def zip_app(data_mat):
    a0 = data_mat[:,0]
    idx = np.flatnonzero(np.concatenate(( [True], a0[1:] != a0[:-1], [True] )))
    return [data_mat[i:j] for i,j in zip(idx[:-1], idx[1:])]

Timings -

In the sample we had a submatrix of max length 6. So, let's extend to a bigger case keeping it with the same pattern -

In [442]: a = np.random.randint(0,100000,(6*100000,4)); a[:,0].sort()

In [443]: %timeit split_app(a)
10 loops, best of 3: 88.8 ms per loop

In [444]: %timeit zip_app(a)
10 loops, best of 3: 40.2 ms per loop

In [445]: a = np.random.randint(0,1000000,(6*1000000,4)); a[:,0].sort()

In [446]: %timeit split_app(a)
1 loop, best of 3: 917 ms per loop

In [447]: %timeit zip_app(a)
1 loop, best of 3: 414 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

You can do this with boolean indexing.

unique_ids = np.unique(data_mat[:, 0])
masks = np.equal.outer(unique_ids, data_mat[:, 0])
for mask in masks:
    tmp_mat = data_mat[mask]
    # do something with tmp_mat ...
    print(tmp_mat)
Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

If the unique ids are already grouped, you could do this with np.split, similar to @Divakar

idx = np.flatnonzero(data_mat[1:, 0] != data_mat[:-1, 0]) + 1
for tmp_mat in np.split(data_mat, idx):
    # do something with tmp_mat ...
    print(tmp_mat)
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • [Here's](https://stackoverflow.com/a/43259123/) why you might want to avoid `np.split` for performance. – Divakar Sep 03 '17 at 13:39
  • My timings show this to be about as fast as yours, @Divakar. No concatenation might help – Daniel F Sep 03 '17 at 13:43
  • Not comparing mine vs yours :) Just stating in general why one might want to avoid `split`. On the comparison though, I guess we need a bigger dataset than the sample. – Divakar Sep 03 '17 at 13:52
  • If you were interested in the timings, added benchmarking in my post. You would need a lot less number of groupings to help this avoid-concatenation+split approach to work better. – Divakar Sep 03 '17 at 14:09