0

I was able to copy my recarray data to a ndarray, do some calculations and return the ndarray with updated values.

Then, I discovered the append_fields() capability in numpy.lib.recfunctions, and thought it would be a lot smarter to simply append 2 fields to my original recarray to hold my calculated values.

When I did this, I found the operation was much, much slower. I didn't have to time it, the ndarray based process takes a few seconds compared to a minute+ with recarray and my test arrays are small, <10,000 rows.

Is this typical? ndarray access is much faster than recarray? I expected some performance degradation due to access by field name, but not this much.

Daniel
  • 2,355
  • 9
  • 23
  • 30
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • 1
    It's unclear which operations you are comparing. `append_fields` has to define a new array, and copy data from the sources, field by field. What size of arrays are dealing with? shape? dtype? – hpaulj Nov 03 '18 at 01:21
  • As hpaulj said, append_fields creates a new array and copies data over so it will be rather slow. In most cases it's better to just return the calculated values. – user2699 Nov 03 '18 at 02:17
  • `recarray/structured array` is convenient when you want to keep a mixed set of values together - for indexing, sorting, saving and loading from files. Most users first encounter them when loading `csv` files. But for calculations across fields, or adding/removing fields, they aren't anything special. Separate arrays with simple numeric dtype are faster, and just as memory efficient. – hpaulj Nov 03 '18 at 02:48
  • Thanks for the feedback. First some background: I use Pytables to extract data from and HDF5, and pytables returns a recarray whenever a table has dissimilar types. – kcw78 Nov 05 '18 at 02:08
  • My data is in 5 different HDF5 datasets/tables (w/ 2 different dytpes). Here is a typical dtype example: dtype([('ID', ' – kcw78 Nov 05 '18 at 02:30
  • I pull 12 different values from each row to calculate 2 new values that are then saved back to the array. I append 2 "np.zero" arrays to the recarray prior to any calculations. The difference in processing time occurs when extracting values, calculating new values and saving them. There are 36,000 rows. When referencing an ndarray, they are processed in a few seconds. When referencing a recarray, they are processed in a few minutes. – kcw78 Nov 05 '18 at 02:31
  • Question: I'm relatively new here. What did I do to get a -1? – kcw78 Nov 09 '18 at 18:49

1 Answers1

5

Updated 15-November-2018
I expanded my timing tests to clarify differences in performance for ndarray, structured array, recarray and masked array (type of record array?). There are subtle differences in each. See discussion here:
numpy-discussion:structured-arrays-recarrays-and-record-arrays

Here are result of my performance tests. I built a very simple example (using 1 of my HDF5 data sets) to compare performance with the same data stored in 4 types of arrays: ndarray, structured array, recarray and masked array. After the arrays are constructed, they are passed to a function that simply loops thru each row and extracts 12 values from each row. The functions are called from the timeit function with a single pass (number=1). This test only measures the array read function, and avoids all other calculations.
Results given below for 9,000 rows:

for ndarray: 0.034137165047070615
for structured array: 0.1306827116913577
for recarray: 0.446010040784266
for masked array: 31.33269560998199

Based on this test, access performance decreases with each type. Access times for structured array and recarray are 4x-13x slower than ndarray access (but all are only a fraction of second). However, ndarray access is 1000x faster than masked array access. That explains the seconds to minutes difference I see in my complete example. Hopefully this data is useful to others that encounter this issue.

kcw78
  • 7,131
  • 3
  • 12
  • 44