I have a huge .csv
file (~2GB) that I import in my program with read_csv
and then convert to an numpy matrix with as_matrix
. The generated matrix has the form like the data_mat
in the given example below. My problem is now, that I need to extract the blocks with the same uuid4 (entry in the first column of the matrix). The submatrices are then processed by another function. It seems that my example below is not the best way doing this. Faster methods are welcome.
import numpy as np
data_mat = np.array([['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 4, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 1, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 3, 3, 1],\
['f9f1dc71-9457-4d17-b5d1-e63b5a766f84', 6, 1, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 4, 1],\
['f35fb25b-dddc-458a-9f71-0a9c2c202719', 3, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 2, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 9, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 0],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 5, 1, 1],\
['a4cf92fc-0624-4a00-97f6-0d21547e3183', 3, 1, 1],\
['d3a8a9d0-4380-42e3-b35f-733a9f9770da', 3, 6, 10]],dtype=object)
unique_ids, indices = np.unique(data_mat[:,0],return_index=True,axis=None)
length = len(data_mat)
i=0
for idd in unique_ids:
index = indices[i]
k=0
while ((index+k)<length and idd == data_mat[index+k,0]):
k+=1
tmp_mat=data_mat[index:(index+k),:]
# do something with tmp_mat ...
print(tmp_mat)
i+=1