1

Is there a faster/smarter way to perform operations on every element of a numpy array? What I specifically have is a list of datetime objects like, e.g.:

hh = np.array( [ dt.date(2000, 1, 1), dt.date(2001, 1, 1) ] )

To get a list of of years from that I do at the moment:

years = np.array( [ x.year for x in hh ] )

Is there a smarter way to do this? I'm thinking something like

hh.year

which obviously doesn't work.

I have a script in which I need different variations of a (much longer) array constantly (year, month, hours...). Of course I could always just define a separate array for everything but like there should be a more elegant solution.

Lukas
  • 407
  • 1
  • 4
  • 21
  • Maybe use pandas's datetime64? Check the answer to this: http://stackoverflow.com/questions/13648774/get-year-month-or-day-from-numpy-datetime64 – ojy Aug 25 '14 at 22:34

2 Answers2

4

If you evaluate a python expression for each element, it doesn't matter whether the iteration will be done in C++ or Python. What will have weight is the python-complexity of the evaluated (in-loop) expression. This means: If your (in-loop) expression takes 1 microsec (a very simple script), it will be significantly harder than the difference between using a python iteration or a C++ iteration (you have a "marshalling" between C++ and PyObjects, and that applies to python functions as well).

For that reason, calling vectorize is -under the hoods- done in Python: what is called inside is python code. The idea behind vectorize is not performance, but code readability and ease of iteration: vectorize performs introspection (of function's parameters) and serves well for N-dimensional iterations (i.e. a lambda x,y: x+y automagically serves to iterate in two dimensions).

So: no, there's no "fast" way to iterate python code. The final speed that matters is the speed of your inner python code.

Edit: your -desired- hh.year looks like hh*.year equivalent in groovy, but even there under the hoods is the same as an in-code iteration. Comprehensions are the fastest (and equivalent) way in python. The real pity is being forced to:

years = np.array( [ x.year for x in hh ] )

(which forces you to create another provably-huge-sized) instead of letting you use any type of iterator:

years = np.array( x.year for x in hh )

Edit (suggestion by @Jaime): You can't construct array with that function from an iterator. For that, you must use:

np.fromiter(x.year for x in hh, dtype=int, count=len(x))

which lets you save the time and memory of building an intermediate array. This exact approach works for any sequence to avoid the inner-list creation (this one would be your case) but does not work with other types of generators, for future cases you'd need.

Luis Masuelli
  • 12,079
  • 10
  • 49
  • 87
  • 1
    There is [`np.fromiter`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html), so `np.fromiter(x.year for x in hh, dtype=int, count=len(x))` is probably going to be as fast as it gets. – Jaime Aug 25 '14 at 23:01
  • `ufunc` is another mechanism. http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html It doesn't speed up the iteration, but gives access to features like ndimensions and broadcasting. – hpaulj Aug 25 '14 at 23:55
0

You can use numpy.vectorize.

Doing some benchmarking, performance is pretty similar (vectorize slightly slower than a list comprehension), and in my opinion numpy.vectorize(lambda j: j.year)(hh) (or something similar) doesn't look super elegant.

colcarroll
  • 3,632
  • 17
  • 25