1

I am using genfromtxt to load large csv files into structured arrays. I need to sort the data (using multiple fields), do some work and then restore the data to the original ordering. My plan is to add another field to the data and put the row number into this field before the first sort is applied. It can then be used to revert the order at the end. I thought there might be an elegant way of adding this field of record numbers but after hours of trying and searching for ideas I have nothing particularly slick.

import numpy
import numpy.lib.recfunctions as rfn
def main():
    csvDataFile = 'C:\\File1.csv'
    csvData = numpy.genfromtxt(csvDataFile, delimiter=',',names = True, dtype='f8')
    rowNums = numpy.zeros(len(csvData),dtype=[('RowID','f8')])
    #populate and add column for RowID
    for i in range (0, len(csvData)):
        rowNums['RowID'][i]=i
    csvDataWithID = rfn.merge_arrays((csvData, rowNums), asrecarray=True, flatten=True)

The recfunctions.merge_arrays in particular is very slow and adding the row numbers one by one seems so old school. Your ideas would be gratefully received.

  • 1
    Is there a reason you're loading the csv and doing your work purely in `numpy` instead of using `pandas`? To me, `pandas` seems like the natural choice, and all the benchmarks I've seen have indicated that `pandas` has superior performance for csv parsing. – root Feb 19 '16 at 00:23
  • Thanks for the tip. I've not come across pandas before. I'll have a look... – smoke_and_mirrors Feb 19 '16 at 00:44

2 Answers2

0

You don't need to fill rowNums iteratively:

In [93]: rowNums=np.zeros(10,dtype=[('RowID','f8')])
In [94]: for i in range(0,10):
   ....:     rowNums['RowID'][i]=i
   ....:     
In [95]: rowNums
Out[95]: 
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
       (8.0,), (9.0,)], 
      dtype=[('RowID', '<f8')])

Just assign the range values to the field:

In [96]: rowNums['RowID']=np.arange(10)
In [97]: rowNums
Out[97]: 
array([(0.0,), (1.0,), (2.0,), (3.0,), (4.0,), (5.0,), (6.0,), (7.0,),
       (8.0,), (9.0,)], 
      dtype=[('RowID', '<f8')])

rfn.merge_arrays shouldn't be that slow - unless csvData.dtype has a great number of fields. This function creates a new dtype that merges the fields of the 2 inputs, and then copies data field by field. For many rows, and just a few fields, that is quite fast.

But you should be able to get the original order back without adding this extra field.

A 2 field 1d array:

In [118]: x = np.array([(4,2),(1, 0), (0, 1),(1,2),(3,1)], dtype=[('x', '<i4'), ('y', '<i4')])
In [119]: i = np.argsort(x, order=('y','x'))
In [120]: i
Out[120]: array([1, 2, 4, 3, 0], dtype=int32)
In [121]: x[i]
Out[121]: 
array([(1, 0), (0, 1), (3, 1), (1, 2), (4, 2)], 
      dtype=[('x', '<i4'), ('y', '<i4')])

The same values are now sorted first on y, then on x.

In [122]: j=np.argsort(i)
In [123]: j
Out[123]: array([4, 0, 1, 3, 2], dtype=int32)
In [124]: x[i][j]
Out[124]: 
array([(4, 2), (1, 0), (0, 1), (1, 2), (3, 1)], 
      dtype=[('x', '<i4'), ('y', '<i4')])

Back to the original order

I could have added a row index array to x, and then done a sort on that. But why add it; why not just apply i to a separate array:

In [127]: np.arange(5)[i]
Out[127]: array([1, 2, 4, 3, 0])

But sorting that is just the same as sorting i.


merge_arrays is doing essentially the following:

Union dtype:

In [139]: dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
In [140]: y=np.zeros((5,),dtype=dt)

fill in the values:

In [141]: y['RowID']=np.arange(5)
In [143]: for name in x.dtype.names:
    y[name]=x[name]
In [144]: y
Out[144]: 
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])

And to test my argsort of argsort idea:

In [145]: y[i]
Out[145]: 
array([(1.0, 1, 0), (2.0, 0, 1), (4.0, 3, 1), (3.0, 1, 2), (0.0, 4, 2)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [146]: np.argsort(y[i],order=('RowID'))
Out[146]: array([4, 0, 1, 3, 2], dtype=int32)
In [147]: j
Out[147]: array([4, 0, 1, 3, 2], dtype=int32)

Sorting on the reordered RowID is the same as sorting on i.


Curiously merge_arrays is quite a bit slower than my reconstruction:

In [163]: rfn.merge_arrays([rowNums,x],flatten=True)
Out[163]: 
array([(0.0, 4, 2), (1.0, 1, 0), (2.0, 0, 1), (3.0, 1, 2), (4.0, 3, 1)], 
      dtype=[('RowID', '<f8'), ('x', '<i4'), ('y', '<i4')])
In [164]: timeit rfn.merge_arrays([rowNums,x],flatten=True)
10000 loops, best of 3: 161 µs per loop

In [165]: %%timeit 
dt=np.dtype(rowNums.dtype.descr+x.dtype.descr)
y=np.zeros((5,),dtype=dt)
y['RowID']=rowNums['RowID']
for name in x.dtype.names:
    y[name]=x[name]
10000 loops, best of 3: 38.4 µs per loop
hpaulj
  • 221,503
  • 14
  • 230
  • 353
0
rowNums = np.zeros(len(csvData),dtype=[('RowID','f8')])
rowNums['RowID']=np.arange(len(csvData))

The above saves approx half a second per file with the csv files I am using. Very good so far.

However the key thing was how to efficiently obtain a record of the sort order. This is most elegantly solved using;

sortorder = np.argsort(csvData, 'col_1','col_2','col_3','col_4','col_5')

giving an array that lists the order of items in CsvData when sorted by cols 1 through 5. This negates the need to make, populate and merge a RowID column, saving me around 15s per csv file (over 6hrs across my entire dataset.)

Thank you very much @hpaulj