When using pandas
dataframes, it's a common situation to create a column B
with the information in column A
.
Background
In some cases, it's possible to do this in one go (df['B'] = df['A'] + 4
), but in others, the operation is more complex and a separate function is written. In that case, this function can be applied in one of two ways (that I know of):
def calc_b(a):
return a + 4
df = pd.DataFrame({'A': np.random.randint(0, 50, 5)})
df['B1'] = df['A'].apply(lambda x: calc_b(x))
df['B2'] = np.vectorize(calc_b)(df['A'])
The resulting dataframe:
A B1 B2
0 17 21 21
1 25 29 29
2 6 10 10
3 21 25 25
4 14 18 18
Perfect - both ways have the correct result. In my code, I've been using the np.vectorize
way, as .apply
is slow and considered bad practise.
Now comes my problem
This method seems to be breaking down when working with datetimes / timestamps. A minimal working example is this:
def is_past_midmonth(dt):
return (dt.day > 15)
df = pd.DataFrame({'date':pd.date_range('2020-01-01', freq='6D', periods=7)})
df['past_midmonth1'] = df['date'].apply(lambda x: is_past_midmonth(x))
df['past_midmonth2'] = np.vectorize(is_past_midmonth)(df['date'])
The .apply
way works; the resulting dataframe is
date past_midmonth1
0 2020-01-01 False
1 2020-01-07 False
2 2020-01-13 False
3 2020-01-19 True
4 2020-01-25 True
5 2020-01-31 True
6 2020-02-06 False
But the np.vectorize
way fails with an AttributeError: 'numpy.datetime64' object has no attribute 'day'
.
Digging a bit with type()
, the elements of df['date']
are of the <class 'pandas._libs.tslibs.timestamps.Timestamp'>
, which is also how the function receives them. In the vectorized function, however, they are received as instances of <class 'numpy.datetime64'>
, which then causes the error.
I have two questions:
- Is there a way to 'fix' this behaviour of
np.vectorize
? How? - How can I avoid these kinds of incompatibilities in general?
Of course I can make a mental note to not use np.vectorize
functions that take datetime arguments, but that is cumbersome. I'd like a solution that always works so I don't have to think about it whenever I encounter this situation.
As stated, this is a minimal working example that demonstrates the problem. I know I could use easier, all-column-at-once operations in this case, exactly as I could in the first example with the int
column. But that's beside the point here; I'm interested in the general case of vectorizing any function that takes timestamp arguments. For those asking about a more concrete/complicated example, I've created one here.
Edit: I was wondering if using type hinting would make a difference - if numpy
would actually take this information into account - but I doubt it, as using this signature def is_past_midmonth(dt: float) -> bool:
, where float
is obviously wrong, gives the same error. I'm pretty new to type hinting though, and I don't have an IDE that supports it, so it's a bit hard for me to debug.
Many thanks!