2

When using pandas dataframes, it's a common situation to create a column B with the information in column A.

Background

In some cases, it's possible to do this in one go (df['B'] = df['A'] + 4), but in others, the operation is more complex and a separate function is written. In that case, this function can be applied in one of two ways (that I know of):

def calc_b(a): 
    return a + 4

df = pd.DataFrame({'A': np.random.randint(0, 50, 5)})
df['B1'] = df['A'].apply(lambda x: calc_b(x))
df['B2'] = np.vectorize(calc_b)(df['A'])

The resulting dataframe:

    A  B1  B2
0  17  21  21
1  25  29  29
2   6  10  10
3  21  25  25
4  14  18  18

Perfect - both ways have the correct result. In my code, I've been using the np.vectorize way, as .apply is slow and considered bad practise.

Now comes my problem

This method seems to be breaking down when working with datetimes / timestamps. A minimal working example is this:

def is_past_midmonth(dt):
    return (dt.day > 15)

df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date'])

The .apply way works; the resulting dataframe is

        date  past_midmonth1
0 2020-01-01           False
1 2020-01-07           False
2 2020-01-13           False
3 2020-01-19            True
4 2020-01-25            True
5 2020-01-31            True
6 2020-02-06           False

But the np.vectorize way fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'.

Digging a bit with type(), the elements of df['date'] are of the <class 'pandas._libs.tslibs.timestamps.Timestamp'>, which is also how the function receives them. In the vectorized function, however, they are received as instances of <class 'numpy.datetime64'>, which then causes the error.

I have two questions:

  • Is there a way to 'fix' this behaviour of np.vectorize? How?
  • How can I avoid these kinds of incompatibilities in general?

Of course I can make a mental note to not use np.vectorize functions that take datetime arguments, but that is cumbersome. I'd like a solution that always works so I don't have to think about it whenever I encounter this situation.

As stated, this is a minimal working example that demonstrates the problem. I know I could use easier, all-column-at-once operations in this case, exactly as I could in the first example with the int column. But that's beside the point here; I'm interested in the general case of vectorizing any function that takes timestamp arguments. For those asking about a more concrete/complicated example, I've created one here.

Edit: I was wondering if using type hinting would make a difference - if numpy would actually take this information into account - but I doubt it, as using this signature def is_past_midmonth(dt: float) -> bool:, where float is obviously wrong, gives the same error. I'm pretty new to type hinting though, and I don't have an IDE that supports it, so it's a bit hard for me to debug.

Many thanks!

Community
  • 1
  • 1
ElRudi
  • 2,122
  • 2
  • 18
  • 33
  • 2
    Looking [at this post](https://stackoverflow.com/questions/13648774/get-year-month-or-day-from-numpy-datetime64) it seems like the ".apply" solution might be the way to go for now. The `vectorize` function converts the `Timedelta` object into a `numpy.datetime64` object so, imo, there is little one can do. – Andrea Jan 03 '20 at 11:17
  • Thanks @Andrea. I had actually seen that post, but thought... it's been 7 years! :) – ElRudi Jan 03 '20 at 11:54
  • I agree, but apparently we are still in the same situation ;) Still, the int workaround by @rpanai seems interesting – Andrea Jan 03 '20 at 12:56
  • " I'm interested in the general case of vectorizing any function that takes timestamp arguments." There's no solution that will be trivially true, some operations are fundamentally iterative and some form of (glorified) looping will maybe be required. For a vectorized implementation, we'll prob need more details – Nathan Furnal Jan 03 '20 at 13:40
  • `np.vectorize` without `otypes` guesses the return `dtype` from a trial calculation. It also only works with `numpy` dtypes. It can be tricky to use even in numpy. – hpaulj Jan 03 '20 at 14:15
  • When it works is vectorize any faster than apply? – hpaulj Jan 03 '20 at 14:18
  • @hpaulj: [yes](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c) – ElRudi Jan 03 '20 at 14:45
  • @NaturalFrequency: I've created a more concrete/complicated example, closer to how I'm actually using it, [here](https://stackoverflow.com/questions/59580504/numpy-vectorization-messes-up-data-type-2). (Also edited the question to include this link.) – ElRudi Jan 03 '20 at 14:54

2 Answers2

3

Have you consider passing the day as int instead of the datetime64[ns]?

import pandas as pd
import numpy as np

# I'd avoid use dt as it's used as alias for datetime
def is_past_midmonth1(d): 
    return (d.day > 15)

def is_past_midmonth2(day):
    return (day > 15)

N = int(1e4)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D',
                                        periods=N)})

Apply (using datetime)

%%time
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth1(x))

CPU times: user 55.4 ms, sys: 0 ns, total: 55.4 ms
Wall time: 53.8 ms

Apply (using int)

%%time
df['past_midmonth2'] = (df['date'].dt.day).apply(lambda x: is_past_midmonth2(x))

CPU times: user 4.71 ms, sys: 0 ns, total: 4.71 ms
Wall time: 4.16 ms

np.vectorize

%%time
df['past_midmonth2_vec'] = np.vectorize(is_past_midmonth2)(df['date'].dt.day)

CPU times: user 4.2 ms, sys: 75 µs, total: 4.27 ms
Wall time: 3.49 ms

Vectorizing your code

%%time
df['past_midmonth3'] = df["date"].dt.day>15

CPU times: user 3.1 ms, sys: 11 µs, total: 3.11 ms
Wall time: 2.41 ms

Timing

enter image description here

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Thanks @rpanai for your answer. However, the problem I presented was a _minimal working example_; I'll update my post to point that out even more. My actual function is a lot more complex, and needs to have the timestamp passed to it, as more checks are done on it (not only it's `day` attribute is used). This makes the 3 proposed solutions impossible I think. – ElRudi Jan 03 '20 at 13:29
0

I'll write this as an Answer, though I feel it's barely a Workaround; so please add your answer if you have one that's better. :)

By forcing the incoming dt argument into a pandas datetime object with dt = pd.to_datetime(dt), it works.

def is_past_midmonth(dt):
    dt = pd.to_datetime(dt) #the only addition
    return (dt.day > 15)

df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date']) #this now works
In[45]: df
Out[45]: 
        date  past_midmonth1  past_midmonth2
0 2020-01-01           False           False
1 2020-01-07           False           False
2 2020-01-13           False           False
3 2020-01-19            True            True
4 2020-01-25            True            True
5 2020-01-31            True            True
6 2020-02-06           False           False

For those interested - execution time is about halved (for a longer dataframe).

ElRudi
  • 2,122
  • 2
  • 18
  • 33