2

I have 2 dataframes

df1 has 700,000 datapoints

  • ID1: Categorical Variable has 400+ categories
  • Date1: 1 year data
  • MST1: 2 level variable 1/2
  • Confirmation Number: unique Identifier

sample data:

ID1 ExtractDate1    MktSeg1 ConfirmationNo
145  3/7/2017          2    29083253
145  3/7/2017          1    29085100
145  3/7/2017          1    29085102
145  3/7/2017          1    29085106
145  3/7/2017          1    29084895
145  3/7/2017          1    29084953

df2 has 100,000 datapoints

  • ID2: Categorical Variable has 400+ categories
  • Date2: 1 year data
  • MST2: 2 level variable 1/2
  • Conf No: unique Identifier

I want to make a new variable, flag to df1 which is marked as 1 when confirmation number present in df1 and df2 else 0

I've achieved this by using the following:

combi1 <- sqldf("SELECT Distinct ID1,
            ExtractDate1,
            MktSeg1,
            ConfirmationNo,
            CASE
            WHEN confno IS NOT NULL
            THEN 1
            ELSE 0
            END AS 'Flag'
            FROM df1 
            LEFT JOIN df2  ON ID1 = ID2
            AND ExtractDate2 >= ExtractDate1
            AND ConfirmationNo = confno", drv = "SQLite")

It take more than 20-30 minutes to give the results, is there any better way of doing it?

I've tried this

combi3 <- left_join(tbl_df(df1),tbl_df(df2),
                by = c("ID1" = "ID2" , "ExtractDate1" <= "ExtractDate2", "ConfirmationNo" = "ConfNo")) %>%
      select(distinct(ID1, ExtractDate1, MktSeg1, ConfirmationNo))

it is throwing the following error:

`by` can't contain join column `TRUE` which is missing from LHS
Toros91
  • 303
  • 2
  • 11
  • 1
    Convert to data.table `setDT`, then set the keys for ID columns `setkey`, then use `%in%` as suggested in MHammer's post below. – zx8754 Jul 03 '18 at 08:17

2 Answers2

3

Using the data you provided and one similar to it for a second data frame, you can use the %in% operator:

df1 <- read.table(text = "ID1 ExtractDate1    MktSeg1 ConfirmationNo
145  3/7/2017          2    29083253
                  145  3/7/2017          1    29085100
                  145  3/7/2017          1    29085102
                  145  3/7/2017          1    29085106
                  145  3/7/2017          1    29084895
                  145  3/7/2017          1    29084953", header = TRUE)

df2 <- read.table(text = "ID1 ExtractDate1    MktSeg1 ConfirmationNo
145  3/7/2017          2    29083253
                  145  3/7/2017          1    29085106
                  145  3/7/2017          1    29084895
                  145  3/7/2017          1    29084953
                  145  3/7/2017          1    29084899
                  145  3/7/2017          1    29084959", header = TRUE)

df1$conf_flag <- as.numeric(df1$ConfirmationNo %in% df2$ConfirmationNo)
df1
MHammer
  • 1,274
  • 7
  • 12
3

If you want to stick with sqldf, which may not be the fastest option, then you may try rewriting your query to use EXISTS:

combi1 <- sqldf("SELECT ID1,
            ExtractDate1,
            MktSeg1,
            ConfirmationNo,
            CASE WHERE EXISTS (SELECT 1 FROM df2
                WHERE ID1 = ID2
                AND ExtractDate2 >= ExtractDate1
                AND ConfirmationNo = confno)
            THEN 1
            ELSE 0
            END AS Flag
            FROM df1", drv = "SQLite")

Switching to EXISTS means we don't have to use DISTINCT anymore. It also lets us do away with the left join, because the logic can stop scanning after the first match.

Note that there is a lot more tuning we could do on an actual database. For sqldf this may be as much as we can do.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • I tried this as well but it is still taking more time. It does the task as well. Thank you so much. – Toros91 Jul 04 '18 at 01:44