Numpy vectorization messes up data type (2)

Question

I'm having unwanted behaviour come out of np.vectorize, namely, it changes the datatype of the argument going into the original function. My original question is about the general case, and I'll use this new question to ask a more specific case.

(Why this second question? I've created this question about a more specific case in order to illustrate the problem - it's always easier to go from the specific to the more general. And I've created this question seperately, because I think it's useful to keep the general case, as well as a general answer to it (should one be found), by themselves and not 'contaminated' with thinking about solving any particular problem.)

So, a concrete example. Where I live, Wednesday is Lottery Day. So, let's start with a pandas dataframe with a date column with all Wednesdays this year:

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=53)})

I want to see which of these possible days I'll actually play on. I don't feel particularly lucky at the beginning and end of each month, and there are some months I feel especially unlucky about. Therefore I use this function to see if a date qualifies:

def qualifies(dt, excluded_months = []):
    #Date qualifies, if...
    #. it's on or after the 5th of the month; and
    #. at least 5 days remain till the end of the month (incl. date itself); and
    #. it's not in one of the months in excluded_months.
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

I hope you realise that this example is still somewhat contrived ;) But it's closer to what I'm trying to do. I try to apply this function in two ways:

df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))
df['qualifies2'] = np.vectorize(qualifies, excluded=[1])(df['date'], [3, 8])

As far as I know, both should work, and I'd prefer the latter, as the former is slow and frowned upon. Edit: I've learned that also the first is frowned upon lol.

However, only the first one succeeds, the second one fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'. And so my question is, if there is a way to use np.vectorize on this function qualifies, which takes a datetime/timestamp as an argument.

Many thanks!

PS: for the interested, this is df:

In [15]: df
Out[15]: 
         date  qualifies1
0  2020-01-01       False
1  2020-01-08        True
2  2020-01-15        True
3  2020-01-22        True
4  2020-01-29       False
5  2020-02-05        True
6  2020-02-12        True
7  2020-02-19        True
8  2020-02-26       False
9  2020-03-04       False
10 2020-03-11       False
11 2020-03-18       False
12 2020-03-25       False
13 2020-04-01       False
14 2020-04-08        True
15 2020-04-15        True
16 2020-04-22        True
17 2020-04-29       False
18 2020-05-06        True
19 2020-05-13        True
20 2020-05-20        True
21 2020-05-27        True
22 2020-06-03       False
23 2020-06-10        True
24 2020-06-17        True
25 2020-06-24        True
26 2020-07-01       False
27 2020-07-08        True
28 2020-07-15        True
29 2020-07-22        True
30 2020-07-29       False
31 2020-08-05       False
32 2020-08-12       False
33 2020-08-19       False
34 2020-08-26       False
35 2020-09-02       False
36 2020-09-09        True
37 2020-09-16        True
38 2020-09-23        True
39 2020-09-30       False
40 2020-10-07        True
41 2020-10-14        True
42 2020-10-21        True
43 2020-10-28       False
44 2020-11-04       False
45 2020-11-11        True
46 2020-11-18        True
47 2020-11-25        True
48 2020-12-02       False
49 2020-12-09        True
50 2020-12-16        True
51 2020-12-23        True
52 2020-12-30       False

1) Vectorize is also frowned upon. Either way, you're running vanilla python code and discarding all the benefits of actual vectorization. — Mad Physicist, Jan 03 '20 at 16:06
2) Vectorize will attempt to convert the first argument into an array. Both `np.array(df['date'])` and `df['date'].values` have `dtype='datetime64[ns]'` — Mad Physicist, Jan 03 '20 at 16:10
@MadPhysicist, looks like specifying `otypes=['boo']` prevents that conversion. It's as though the `np.array(...)` conversion applies only when it is doing the implied `otype` calculation. I also see this behavior when I test a simpler function that just prints the `dt` type. — hpaulj, Jan 03 '20 at 17:15
@MadPhysicist, when doing the `otypes` calculation, `vectorize` uses `np.asarray(...).ravel()[0]` to get the first test value. But for the main `loop` it does an `np.array(..., dtype=object)` in preparation to sending the args to `frompyfunc`. Maybe we should file an `issue` on this. — hpaulj, Jan 03 '20 at 20:42
@hpaulj. Definitely seems worthwhile, although I can't think of a better solution to doing a test-run. Perhaps deferring the type check until later would be useful. — Mad Physicist, Jan 03 '20 at 20:55

score 2 · Answer 1 · answered Jan 03 '20 at 16:07

I think @rpanai answer on the original post is still the best. Here I share my tests:

def qualifies(dt, excluded_months = []):
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

def new_qualifies(dt, excluded_months = []):
    dt = pd.Timestamp(dt)
    if dt.day < 5:
        return False
    if (dt + pd.tseries.offsets.MonthBegin(1) - dt).days < 5:
        return False
    if dt.month in excluded_months:
        return False
    return True

df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', periods=12000)})

apply method:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: qualifies(x, [3, 8]))

385 ms ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

conversion method:

%%timeit
df['qualifies1'] = df['date'].apply(lambda x: new_qualifies(x, [3, 8]))

389 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vectorized code:

%%timeit
df['qualifies2'] =  np.logical_not((df['date'].dt.day<5).values | \
    ((df['date']+pd.tseries.offsets.MonthBegin(1)-df['date']).dt.days < 5).values |\
    (df['date'].dt.month.isin([3, 8])).values)

4.83 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Thanks for replying @Adrea. that's surely a large performance gain! I find it a lot less readable tbh, but I guess that's the trade-off between performance and readability. — ElRudi, Jan 03 '20 at 16:14
@ElRudi. You just haven't learned how to read it yet. You will with enough practice though. — Mad Physicist, Jan 03 '20 at 16:17
I agree, much less readable. Maybe one can try to fix it by splitting the problem into 3 parts and use three different columns, still the `.apply` method is much more elegant (more pythonic at least) — Andrea, Jan 03 '20 at 16:18

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Summary

If using np.vectorize it's best to specify otypes. In this case, the error is caused by the trial calculation the vectorize uses when otypes is not specified. An alternative is to pass the Series as an object type array.

np.vectorize has a performance disclaimer. np.frompyfunc may be faster, or even a list comprehension.

testing vectorize

Let's define a simpler function - one that displays the type of the argument:

In [31]: def foo(dt, excluded_months=[]): 
    ...:     print(dt,type(dt)) 
    ...:     return True

And a smaller dataframe:

In [32]: df = pd.DataFrame({'date': pd.date_range('2020-01-01', freq='7D', perio
    ...: ds=5)})                                                                
In [33]: df                                                                     
Out[33]: 
        date
0 2020-01-01
1 2020-01-08
2 2020-01-15
3 2020-01-22
4 2020-01-29

Testing vectorize. (vectorize docs says using the excluded parameter degrades performance, so I'm using lambda as used by with apply):

In [34]: np.vectorize(lambda x:foo(x,[3,8]))(df['date'])                        
2020-01-01T00:00:00.000000000 <class 'numpy.datetime64'>
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[34]: array([ True,  True,  True,  True,  True])

That first line is the datetime64 that gives problems. The other lines are the orginal pandas objects. If I specify the otypes, that problem goes away:

In [35]: np.vectorize(lambda x:foo(x,[3,8]), otypes=['bool'])(df['date'])       
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[35]: array([ True,  True,  True,  True,  True])

the apply:

In [36]: df['date'].apply(lambda x: foo(x, [3, 8]))                             
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-15 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-22 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-29 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Out[36]: 
0    True
1    True
2    True
3    True
4    True
Name: date, dtype: bool

A datetime64 dtype is produced by wrapping the the Series in np.array.

In [37]: np.array(df['date'])                                                   
Out[37]: 
array(['2020-01-01T00:00:00.000000000', '2020-01-08T00:00:00.000000000',
       '2020-01-15T00:00:00.000000000', '2020-01-22T00:00:00.000000000',
       '2020-01-29T00:00:00.000000000'], dtype='datetime64[ns]')

Apparently np.vectorize is doing this sort of wrapping when performing the initial trial calculation, but not when doing the main iterations. Specifying the otypes skips that trial calculation. That trial calculation has caused problems in other SO, though this is a more obscure case.

In that past when I've tested np.vectorize it is slower than a more explicit iteration. It does have a clear performance disclaimer. It's most valuable when the function takes several inputs, and needs the benefit of broadcasting. It's hard to justify when using only one argument.

np.frompyfunc underlies vectorize, but returns an object dtype. Often it is 2x faster than explicit iteration on an array, though similar in speed to iteration on a list. It seems to be most useful when creating and working with a numpy array of objects. I haven't gotten it working in this case.

vectorize code

The np.vectorize code is in np.lib.function_base.py.

If otypes is not specified, the code does:

        args = [asarray(arg) for arg in args]
        inputs = [arg.flat[0] for arg in args]
        outputs = func(*inputs)

It makes each argument (here only one) into an array, and takes the first element. And then passes that to the func. As Out[37] shows, that will be a datetime64 object.

frompyfunc

To use frompyfunc, I need to convert the dtype of df['date']:

In [68]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'])                  
1577836800000000000 <class 'int'>
1578441600000000000 <class 'int'>
...

without it, it passes int to the function, with it, it passes the pandas time objects:

In [69]: np.frompyfunc(lambda x:foo(x,[3,8]), 1,1)(df['date'].astype(object))   
2020-01-01 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
2020-01-08 00:00:00 <class 'pandas._libs.tslibs.timestamps.Timestamp'>
...

So this use of qualifies works:

In [71]: np.frompyfunc(lambda x:qualifies(x,[3,8]),1,1)(df['date'].astype(object))                                                                     
Out[71]: 
0    False
1     True
2     True
3     True
4    False
Name: date, dtype: object

object dtype

For the main iteration, np.vectorize does

      ufunc = frompyfunc(_func, len(args), nout)
      # Convert args to object arrays first
        inputs = [array(a, copy=False, subok=True, dtype=object)
                  for a in args]
        outputs = ufunc(*inputs)

That explains why vectorize with otypes works - it is using frompyfunc with an object dtype input. Contrast this with Out[37]:

In [74]: np.array(df['date'], dtype=object)                                     
Out[74]: 
array([Timestamp('2020-01-01 00:00:00'), Timestamp('2020-01-08 00:00:00'),
       Timestamp('2020-01-15 00:00:00'), Timestamp('2020-01-22 00:00:00'),
       Timestamp('2020-01-29 00:00:00')], dtype=object)

And an alternative to specifying otypes is to make sure you are passing object dtype to vectorize:

In [75]: np.vectorize(qualifies, excluded=[1])(df['date'].astype(object), [3, 8])                                                                      
Out[75]: array([False,  True,  True,  True, False])

This appears to be the fastest version:

np.frompyfunc(lambda x: qualifies(x,[3,8]),1,1)(np.array(df['date'],object))

or better yet, a plain Python iteration:

[qualifies(x,[3,8]) for x in df['date']]

score 0 · Answer 3 · answered Jan 03 '20 at 15:51

0

Just as in the original question, I can "solve" the problem by forcing the incoming argument to be a pandas datetime object, by adding dt = pd.to_datetime(dt) before the first if-statement of the function.

To be honest, this feels like patching-up something that's broken and should not be used. I'll just use .apply instead and take the performance hit. Anyone that feels there's a better solution is very much invited to share :)

answered Jan 03 '20 at 15:51

ElRudi

2,122
2
18
33

Is this approach really faster than the `.apply` one performance-wise? In my tests it turns out it is actually slower – Andrea Jan 03 '20 at 15:57
I just tried and interestingly, it's not. I've tried with short (53) and long (5300 rows) dataframes, the times are a few % apart at most. It's interesting because if you look at my answer to the original question (linked in this answer), you see that it's twice as fast when the function-to-be-vectorized takes only one argument. – ElRudi Jan 03 '20 at 16:03
I manage to also fix this by adding ``dt = pd.Timestamp(dt)`` too since the attribute ``day`` comes from from the ``Timestamp`` class. It seems that ``vectorize`` accesses ``df["date"].values`` quietly or something. It probably deserves to open a github issue. – Nathan Furnal Jan 03 '20 at 16:04
I think the problem might be that numpy `vectorize` can only work with numpy's datatypes and, since a `Timestamp` object is not, it can not work. – Andrea Jan 03 '20 at 16:14
`np.vectorize` is a `numpy` function. It is not `pandas` aware! – hpaulj Jan 03 '20 at 17:34

Numpy vectorization messes up data type (2)

3 Answers3

Summary

testing vectorize

vectorize code

frompyfunc

object dtype

Linked

Related