I have 2 dataframes
df1 has 700,000 datapoints
- ID1: Categorical Variable has 400+ categories
- Date1: 1 year data
- MST1: 2 level variable 1/2
- Confirmation Number: unique Identifier
sample data:
ID1 ExtractDate1 MktSeg1 ConfirmationNo
145 3/7/2017 2 29083253
145 3/7/2017 1 29085100
145 3/7/2017 1 29085102
145 3/7/2017 1 29085106
145 3/7/2017 1 29084895
145 3/7/2017 1 29084953
df2 has 100,000 datapoints
- ID2: Categorical Variable has 400+ categories
- Date2: 1 year data
- MST2: 2 level variable 1/2
- Conf No: unique Identifier
I want to make a new variable, flag to df1 which is marked as 1 when confirmation number present in df1 and df2 else 0
I've achieved this by using the following:
combi1 <- sqldf("SELECT Distinct ID1,
ExtractDate1,
MktSeg1,
ConfirmationNo,
CASE
WHEN confno IS NOT NULL
THEN 1
ELSE 0
END AS 'Flag'
FROM df1
LEFT JOIN df2 ON ID1 = ID2
AND ExtractDate2 >= ExtractDate1
AND ConfirmationNo = confno", drv = "SQLite")
It take more than 20-30 minutes to give the results, is there any better way of doing it?
I've tried this
combi3 <- left_join(tbl_df(df1),tbl_df(df2),
by = c("ID1" = "ID2" , "ExtractDate1" <= "ExtractDate2", "ConfirmationNo" = "ConfNo")) %>%
select(distinct(ID1, ExtractDate1, MktSeg1, ConfirmationNo))
it is throwing the following error:
`by` can't contain join column `TRUE` which is missing from LHS