pandas reindex a dataframe with duplicate keys

Question

Here is an example of the problem:

>>> df = DataFrame({'a':[1,2]},index=[datetime.today(),datetime.today()+timedelta(days=1)])
>>> df
                            a
2013-02-15 09:36:14.665272  1
2013-02-16 09:36:14.857322  2
>>> dup_index = datetime.today()
>>> df2 = DataFrame({'a':[2,3]},index=[dup_index,dup_index])
>>> df2
                            a
2013-02-15 09:37:11.701271  2
2013-02-15 09:37:11.701271  3
>>>
>>> df2.reindex(df.index,method='ffill')
Traceback (most recent call last):
...
Exception: Reindexing only valid with uniquely valued Index objects

I wish to merge df2 with df. Because the index times do not match up I wish to match the df2 time with the closest last time in df, which is the first row. One artificial way I had come up with to solve this was to add a fake microsecond value to the second time series so that it becomes unique. But this is slow for big dataframes. Is there a particular reason why this is not allowed? It seems like a logical thing to do. Are there any better ways for me to overcome this limitation?

perhaps you shoul d give an example for your original problem (with a time series index). mabe there is another option to solve your problem. — bmu, Feb 14 '13 at 09:46
I have modified my example to be closer to my original problem. — tarotcard, Feb 15 '13 at 02:10
I still don't really understand the question. `pd.concat((df, df2))` works if you only want to concatenate the df's. — bmu, Feb 15 '13 at 20:22
I want to do pd.merge on on df and df2 on the index, but since the index is not exactly equal, merging does not work. What I ultimately want to do, is to merge df2 with the roughly closest index in df. — tarotcard, Feb 16 '13 at 07:48
Wasn't this issue [fixed back in pandas 0.9.1](https://github.com/pydata/pandas/issues/2236)? If yes, please update your title accordingly. — smci, May 02 '13 at 13:29

score 3 · Accepted Answer · answered Feb 21 '13 at 15:06

I ran into a similar problem recently. I solved it by first removing duplicates from df2. Doing it this way makes you think about which one to keep and which to discard. Unfortunately, pandas doesn't seem to have a great way to remove duplicates based on duplicate index entries, but this workaround (adding an 'index' column to df2) should do it:

>>> df2['index'] = df2.index
>>> df3 = df2.drop_duplicates(cols='index', take_last=True).reindex(df.index, method='ffill')
>>> del df3['index']
>>> df3
                             a
2013-02-21 09:51:56.615338 NaN
2013-02-22 09:51:56.615357   3

Of course you could set 'take_last=False' to get a value of 2 for the a column.

I noticed that you said "I wish to match the df2 time with the closest last time in df, which is the first row". I didn't quite understand this statement. The closest times in df to the time in df2 is the second row, not the first row. If I misunderstood your question, let me know and I'll update this answer.

For reference, here is my test data:

>>> df
                            a
2013-02-21 09:51:56.615338  1
2013-02-22 09:51:56.615357  2
>>> df2
                            a
2013-02-21 09:51:57.802331  2
2013-02-21 09:51:57.802331  3

Thanks for answering, I came up eventually with the same bypass by removing duplicates from df2. Actually you can remove duplicate indices by doing df2.groupby(df2.index).first(). As for the puzzling statement, it was indeed a mistake, I did not notice the minute was different. — tarotcard, Feb 26 '13 at 04:45

pandas reindex a dataframe with duplicate keys

1 Answers1