3

I am writing a library of Pandas routines that needs to be able to deal with dates in data frames that are potentially of different types. Specifically, I get different combinations of types datetime.date and pandas._libs.tslib.Timestamp quite a bit. This is reported (and confirmed by my testing) to be related to frames having had a multi-index set and then reset. (See my earlier question What changes type of date in this pandas code?, which addresses the issue of the type being changed when going back and forth from multi-index.)

Here's a short (but contrived) example:

import pandas as pd
df = pd.DataFrame(
   data={
      'date' : ['2019-01-01', '2019-01-02', '2019-01-03'], 
      'value' : [1, 2, 3], 
      'other' : [11, 12, 13]
   }
)

df.date = pd.to_datetime(df.date).dt.date

print df.head()
         date  other  value
0  2019-01-01     11      1
1  2019-01-02     12      2
2  2019-01-03     13      3

df_reindex = df.set_index(['date','other']).reset_index()
         date  other  value
0  2019-01-01     11      1
1  2019-01-02     12      2
2  2019-01-03     13      3

print pd.merge(df, df_reindex, on='date')
Empty DataFrame
Columns: [date, other_x, value_x, other_y, value_y]
Index: []

print pd.merge(df, df, on='date')
         date  other_x  value_x  other_y  value_y
0  2019-01-01       11        1       11        1
1  2019-01-02       12        2       12        2
2  2019-01-03       13        3       13        3

print pd.merge(df_reindex, df_reindex, on='date')
        date  other_x  value_x  other_y  value_y
0 2019-01-01       11        1       11        1
1 2019-01-02       12        2       12        2
2 2019-01-03       13        3       13        3

print type(df.date[0])
<type 'datetime.date'>

print type(df_reindex.date[0])
<class 'pandas._libs.tslib.Timestamp'>

Here both df and df_reindex have essentially the same data content, but, because of the type having been changed inside of Pandas at the point of the set_index, the merge between them is empty while the "self-merge" between either of the two with itself gives the expected (albeit, in the contrived case here, redundant and trivial) result.

Real cases, of course, reach this point without having set and reset the index capriciously like this - The real data passes through multiple bits of code performing different operations with different indexing requirements at different stages, and the merge is between frames that have some non-overlapping columns.

There's a comment on my other question about using NumPy datetimes instead, but that seems futile, as the conversion apparently results in one of the two problematic data types, i.e. the underlying Pandas class:

df_numpy = df.copy()

df_numpy.date = df.date.apply(np.datetime64)

print type(df.date[0])
<type 'datetime.date'>

print type(df_numpy.date[0])
<class 'pandas._libs.tslib.Timestamp'>

(Not to mention that I'm working within an existing framework so it may not be possible to force all frames to have NumPy types instead of Pandas types at this point.)

What I need to do is be able to merge tables inside the library code, whether or not they have been so manipulated by the caller outside of my control. The data frames that I get as inputs cannot be changed. I could copy them and implement a direct conversion on the copies (as here pandas merge on date column issue) but the frames are sometimes large and I don't want to copy them unless there's no other option.

Is there a way to get the merge to recognize these as equivalent? If not, is there a best-practice choice of date format that avoids the conversion issue exposed here?

Brick
  • 3,998
  • 8
  • 27
  • 47
  • Have you tried forcing the datetime using df['date'] = pd.to_datetime(df['date'],unit='D') (or any other unit as necessary) before merging? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html – rhedak Oct 08 '19 at 01:19
  • That has a side effect on the input, @rhedak, which isn't allowed and is covered in the second question linked. I'd have to copy the frame to avoid the side effect, which is what I hope isn't needed. I suppose maybe I could copy just the series, modify it in place, and copy the original back at the end, but that's still more than should ideally be necessary, especially since in many case the data types are matched. Ultimately I think this is a bug in Pandas - that it changes the types sometimes and not others (see the first linked question for details) - but I need a workaround. – Brick Oct 08 '19 at 02:39
  • can't you make another column just for joining and then drop it in the end? – rhedak Oct 09 '19 at 02:27

1 Answers1

0

Is there a way to get the merge to recognize these as equivalent?

No, not with the current pandas code:

    # datetimelikes must match exactly
    elif is_datetimelike(lk) and not is_datetimelike(rk):
        raise ValueError(msg)
    elif not is_datetimelike(lk) and is_datetimelike(rk):
        raise ValueError(msg)
    elif is_datetime64tz_dtype(lk) and not is_datetime64tz_dtype(rk):
        raise ValueError(msg)
    elif not is_datetime64tz_dtype(lk) and is_datetime64tz_dtype(rk):
        raise ValueError(msg)

If not, is there a best-practice choice of date format that avoids the conversion issue exposed here?

I understand from your question that the dataframes and their data types are outside of your control and that they cannot be changed, so this question will lead us nowhere.


What you would need is a conditional join like in SQL. There's an old open issue for this feature with no activity since 2014. (PRs are invited...)


A possible workaround would be something like that:
def merge_on_date(left, right, on):
    from pandas.core.dtypes.common import is_datetimelike
    try:
        return pd.merge(left, right, on=on)
    except:
        if is_datetimelike(right[on]):
            return pd.merge(left, right.assign(**{on: eval('right[on].dt.date')}), on=on)
        else:
            return pd.merge(left.assign(**{on: eval('left[on].dt.date')}), right, on=on)

Result:

>>> merge_on_date(df, df_reindex, 'date')
         date  value_x  other_x  other_y  value_y
0  2019-01-01        1       11       11        1
1  2019-01-02        2       12       12        2
2  2019-01-03        3       13       13        3
>>> merge_on_date(df_reindex, df, 'date')
         date  other_x  value_x  value_y  other_y
0  2019-01-01       11        1        1       11
1  2019-01-02       12        2        2       12
2  2019-01-03       13        3        3       13

This big drawback however is that assign makes a copy under the hood.

PS: I just saw that pd.merge(df, df_reindex, on='date') yields an empty dataframe in your example. Since version 0.22.0 this should raise a ValueError. What version are you using?

Stef
  • 28,728
  • 2
  • 24
  • 52
  • I ran the example on a system using pandas version 0.20.1. The real code runs on multiple machines running a variety of versions, although I'm yet to see a `ValueError` for this case, so I guess none of them are running 0.22.0 or later, based on your comment. – Brick Oct 09 '19 at 14:43
  • OK, in this case you'll have to refactor the `merge_on_date` function a bit to do the tests in the first place and do a regular merge if left and right are of same type. – Stef Oct 09 '19 at 14:48