I am writing a library of Pandas routines that needs to be able to deal with dates in data frames that are potentially of different types. Specifically, I get different combinations of types datetime.date
and pandas._libs.tslib.Timestamp
quite a bit. This is reported (and confirmed by my testing) to be related to frames having had a multi-index set and then reset. (See my earlier question What changes type of date in this pandas code?, which addresses the issue of the type being changed when going back and forth from multi-index.)
Here's a short (but contrived) example:
import pandas as pd
df = pd.DataFrame(
data={
'date' : ['2019-01-01', '2019-01-02', '2019-01-03'],
'value' : [1, 2, 3],
'other' : [11, 12, 13]
}
)
df.date = pd.to_datetime(df.date).dt.date
print df.head()
date other value
0 2019-01-01 11 1
1 2019-01-02 12 2
2 2019-01-03 13 3
df_reindex = df.set_index(['date','other']).reset_index()
date other value
0 2019-01-01 11 1
1 2019-01-02 12 2
2 2019-01-03 13 3
print pd.merge(df, df_reindex, on='date')
Empty DataFrame
Columns: [date, other_x, value_x, other_y, value_y]
Index: []
print pd.merge(df, df, on='date')
date other_x value_x other_y value_y
0 2019-01-01 11 1 11 1
1 2019-01-02 12 2 12 2
2 2019-01-03 13 3 13 3
print pd.merge(df_reindex, df_reindex, on='date')
date other_x value_x other_y value_y
0 2019-01-01 11 1 11 1
1 2019-01-02 12 2 12 2
2 2019-01-03 13 3 13 3
print type(df.date[0])
<type 'datetime.date'>
print type(df_reindex.date[0])
<class 'pandas._libs.tslib.Timestamp'>
Here both df
and df_reindex
have essentially the same data content, but, because of the type having been changed inside of Pandas at the point of the set_index
, the merge between them is empty while the "self-merge" between either of the two with itself gives the expected (albeit, in the contrived case here, redundant and trivial) result.
Real cases, of course, reach this point without having set and reset the index capriciously like this - The real data passes through multiple bits of code performing different operations with different indexing requirements at different stages, and the merge is between frames that have some non-overlapping columns.
There's a comment on my other question about using NumPy datetimes instead, but that seems futile, as the conversion apparently results in one of the two problematic data types, i.e. the underlying Pandas class:
df_numpy = df.copy()
df_numpy.date = df.date.apply(np.datetime64)
print type(df.date[0])
<type 'datetime.date'>
print type(df_numpy.date[0])
<class 'pandas._libs.tslib.Timestamp'>
(Not to mention that I'm working within an existing framework so it may not be possible to force all frames to have NumPy types instead of Pandas types at this point.)
What I need to do is be able to merge tables inside the library code, whether or not they have been so manipulated by the caller outside of my control. The data frames that I get as inputs cannot be changed. I could copy them and implement a direct conversion on the copies (as here pandas merge on date column issue) but the frames are sometimes large and I don't want to copy them unless there's no other option.
Is there a way to get the merge to recognize these as equivalent? If not, is there a best-practice choice of date format that avoids the conversion issue exposed here?