2

I have a Dataframe that looks like:

------------------------------
|Date       | Deal | Country |
------------------------------
|2019-01-02 | ABC  | US      |
------------------------------
|2019-02-01 | ABC  | US      |
------------------------------
|2019-03-03 | ABC  | US      |
------------------------------

Another Dataframe that looks similar with Country populated based on certain logic

------------------------------
|Date       | Deal | Country |
------------------------------
|2019-01-01 | ABC  | NaN      |
------------------------------
|2019-02-01 | ABC  | US      |
------------------------------
|2019-03-02 | ABC  | NaN      |
------------------------------

Notice that the dates are different. In the second Dataframe, I wish to populate the Null Country tags by referencing the latest Country tag from the first Dataframe, So I can do something like, searching the Deal in 1st Dataframe and sorting in descending order by date and getting the latest Country column to populate the 2nd Dataframe. However, this gets very slow for 2million+ rows....

Is there any better way to do this?

I tried groupby(['Date', 'Deal']).agg(lambda x: sorted(x)[0]) but can't merge with the 1st Dataframe because the Dates don't match. I just need to fill it with the nearest Country tag based on Date.....

Any pointers on this would be greatly appreciated! thanks in advance..

CodeSsscala
  • 729
  • 3
  • 11
  • 23
  • Will you please share df.head().to_dict () ? It will give some sample data. – Charles Fox Sep 16 '19 at 17:43
  • Sure, here is the data for 1st one {'Country': {0: 'US', 1: 'US', 2: 'US'}, 'Date': {0: '2019-01-02', 1: '2019-02-01', 2: '2019-03-03'}, 'Deal': {0: 'ABC', 1: 'ABC', 2: 'ABC'}} & the 2nd one {'Country': {0: 'US', 1: 'US', 2: 'US'}, 'Date': {0: '2019-01-01', 1: '2019-02-01', 2: '2019-03-02'}, 'Deal': {0: 'ABC', 1: 'ABC', 2: 'ABC'}} – CodeSsscala Sep 16 '19 at 17:47
  • Do both tables have 2million + rows or is the second table much larger than the first? – Charles Fox Sep 16 '19 at 17:50
  • 2nd table is actually smaller than 1st (50% smaller) so like 1 mil rows – CodeSsscala Sep 16 '19 at 17:52

2 Answers2

0
import pandas as pd
import numpy as np

tbl1 = pd.DataFrame({'Country': {0: 'US0', 1: 'US1', 2: 'US2'}, 'Date': {0: '2019-01-02', 1: '2019-02-01', 2: '2019-03-03'}, 'Deal': {0: 'ABC', 1: 'ABC', 2: 'ABC'}})
tbl2 = pd.DataFrame({'Country': {0: 'US', 1: np.nan, 2: np.nan}, 'Date': {0: '2019-01-01', 1: '2019-02-01', 2: '2019-03-02'}, 'Deal': {0: 'ABC', 1: 'ABC', 2: 'ABC'}})

print(tbl1)
print(tbl2)

t1_index = pd.MultiIndex.from_frame(tbl1[['Deal','Date']])
tbl1.index = t1_index
tbl1.sort_index(inplace=True)

tbl2['country_filled'] = tbl2.apply(lambda row: row['Country'] if pd.notnull(row['Country']) 
                                    else tbl1.xs(row['Deal'],level='Deal').loc[:row['Date']]['Country'][-1],
                                    axis=1)

print(tbl1)
print(tbl2)

Sorting table 1 first before doing your look-ups rather than inside the agg function may help.

Charles Fox
  • 261
  • 2
  • 10
0

you can use merge_asof with the method= 'nearest' on 'Date' to find the closest date after merging by 'Deal' first. Then fillna the column country in df2

df2 = df2.fillna(pd.merge_asof(df2.drop('Country', axis=1), df1, 
                               by='Deal', # merge first on same Deal
                               on='Date', # merge on nearest date
                               direction='nearest')) #closest date
Ben.T
  • 29,160
  • 6
  • 32
  • 54