Extremely slow on np.recarray assignment

Question

I'm storing ticks with ndarray, each tick has a utc_timestamp[str] as index, tick price/vols as values. Thus I have an array of 2 different dtypes(str and float). This this the way I store it as a np.recarray

data = np.recarray((100,), dtype=[('time':'U23'),('ask1':'f'),('bid1':'f')])
tick = ['2021-04-28T09:38:30.928',14.21,14.2]

# assigning this tick to the end of data, wield
%%timeit
  ...: data[-1] = np.rec.array(tick)
  ...: 
1.38 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

tooks 1.38ms per loop!! plus, i can't set the last row using data[-1] = tick which would raise ValueError: setting an array element with a sequence

let's try simple ndarray, say i have 2 seperate arrays, one for str and one for float

%%timeit
  ...: data[:,-1]=tick[1:]
  ...: 
15.2 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

see? that's 90x faster! why is that?

point is np.rec.array construction is very slow: ```%%timeit ...: np.rec.array(tick) ...: 1.33 ms ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %%timeit ...: np.rec.array(tick,dtype=dtype) ...: 919 µs ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)``` — aEgoist, Apr 28 '21 at 05:27

hpaulj · Answer 1 · 2021-04-28T15:31:49.230

My times are quite a bit better:

In [503]: timeit data[-1] = np.rec.array(tick)
64.4 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

np.rec.array(tick) creates a dtype=[('f0', '<U23'), ('f1', '<f8'), ('f2', '<f8')]). I get better speed if I use the final dtype.

In [504]: timeit data[-1] = np.rec.array(tick, data.dtype)
31.1 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

A bulk of that time is creating the 1 term recarray:

In [516]: %timeit x = np.rec.array(tick, data.dtype)
29.9 µs ± 41.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Making a structured array instead:

In [517]: %timeit x = np.array(tuple(tick), data.dtype)    
2.71 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [519]: timeit data[-1] = np.array(tuple(tick), data.dtype)
3.58 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So skipping recarray entirely:

In [521]: data = np.zeros((100,), dtype=[('time','U23'),('ask1','f'),('bid1','f')])
     ...: tick = ('2021-04-28T09:38:30.928',14.21,14.2)
In [522]: data[-1] = np.array(tick, data.dtype)
In [523]: data[-2:]
Out[523]: 
array([('',  0.  ,  0. ), ('2021-04-28T09:38:30.928', 14.21, 14.2)],
      dtype=[('time', '<U23'), ('ask1', '<f4'), ('bid1', '<f4')])

I think recarray has largely been replaced by structured array. The main thing recarray adds is the ability to address fields as attributes

data.time, data.ask1
data['time'], data['ask1']

Your example shows that recarray slows things down.

edit

The tuple tick can be assigned directly without extra conversion:

In [526]: timeit data[-1] = tick
365 ns ± 0.247 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

my actual float cols are much more, say from ('ask1', ' – aEgoist Apr 28 '21 at 09:20 — aEgoist, Apr 28 '21 at 09:20
Assigning the tuple tick directly is faster. – hpaulj Apr 28 '21 at 15:32 — hpaulj, Apr 28 '21 at 15:32

Extremely slow on np.recarray assignment

1 Answers1

edit