4

I have the dataset with the following structure:

Classes 'data.table' and 'data.frame':  214175 obs. of  12 variables:
 $ app_id          : int  35949 49453 49970 50913 5| __truncated__ ...
 $ customer_id     : int  35948 49452 49452 50912 5| __truncated__ ...
 $ sent_at         : POSIXct, format: "2012-07-01 23:33:32.262" "2012-08-19 09:56:37.533" "2012-08-25 02:09:41.296" "2012-08-26 09:54:35.345" ...
 $ ip_address_id   : int  102298 96092 96091 67963 | __truncated__ ...
 $ device_id       : int  NA NA NA NA NA NA NA NA NA NA ...
 $ person_id       : int  138622 9551 9551 28228 14| __truncated__ ...
 $ passport_id     : int  115828 148527 148527 1464| __truncated__ ...
 $ email_id        : int  19097 2685 2685 163914 69| __truncated__ ...
 $ mobile_phone_id : int  104954 157463 157463 1032| __truncated__ ...
 $ work_phone_id   : int  68337 309192 309192 11972| __truncated__ ...
 $ contact_phone_id: int  NA NA NA NA NA NA NA NA NA NA ...
 $ rejected        : logi  FALSE FALSE TRUE TRUE TRUE FALSE ...

could you help me to translate this SQL script with subqueries to the data.table expressions:

SELECT app_id,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE device_id = a.device_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id) AS rule_1,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE device_id = a.device_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id
           AND rejected = 1) AS rule_2,
       (SELECT count(DISTINCT person_id)
          FROM am_data
         WHERE device_id = a.device_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id) AS rule_3,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE ip_address_id = a.ip_address_id
           AND sent_at < a.sent_at
           AND sent_at >= datetime(a.sent_at, '-14 days')
           AND customer_id != a.customer_id
           AND rejected = 1) AS rule_4,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE contact_phone_id = a.mobile_phone_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id
           AND rejected = 1) AS rule_5,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE work_phone_id = a.mobile_phone_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id
           AND rejected = 1) AS rule_6,
       (SELECT count(DISTINCT customer_id)
          FROM am_data
         WHERE person_id = a.person_id
           AND passport_id != a.passport_id
           AND sent_at < a.sent_at
           AND customer_id != a.customer_id) AS rule_7
  FROM am_data AS a;

My solution for the rule_1:

cols <- c("device_id", "customer_id", "app_id", "sent_at", "rejected")
tmp <- am_data[!is.na(device_id), ..cols]
setorder(tmp, sent_at)

tmp[, rule_1 := if (.N > 1L) cumsum(!duplicated(customer_id)) - 1L else 0L, by = "device_id"]

To get the desired results to compare data.table solutions do the following code:

## ---- Load packages ----

library(data.table)
library(sqldf)

## ---- Load data ----

am_data <- fread("https://gitlab.com/artemklevtsov/test-dt/raw/master/data/am_data.csv")
am_data <- unique(am_data, by = "app_id") # omit duplicates
am_data[, sent_at := as.POSIXct(sent_at, format = "%Y-%m-%dT%H:%M:%OS")]
setorder(am_data, sent_at)

## ---- SQL query ----

sql_query <- readLines("https://gitlab.com/artemklevtsov/test-dt/raw/master/sql/query.sql")
idx_cols <- names(am_data)[!sapply(am_data, is.logical)]
res <- sqldf(
    c(sprintf("CREATE INDEX %s_idx ON am_data(%s)", idx_cols, idx_cols),
      paste(sql_query, collapse = "\n"))
)

NOTE: I will accept any answer containing a solution that matches the correct result.

Links:

Artem Klevtsov
  • 9,193
  • 6
  • 52
  • 57
  • You may want to translate what all the rules mean. Based on your data.table translation, it seems like ```rule1``` is the cumulative amount of texts sent from each device type. I think?? Regardless, the SQL right now is a self-join. Your first data.table translation is not a self-join. I would recommend re-writing the SQL as some kind of analytic function or to re-write the data.table translation as non-equi self joins. – Cole Apr 28 '19 at 02:04
  • I think self join SQL logic is more clear to understand conditions. I will confirmed any solution which give the right results. – Artem Klevtsov Apr 28 '19 at 04:59
  • 1
    Unfortunately, the non-equi joins of data.table do not allow for not equal (i.e., ```!=```). So it would be hard to efficiently translate it as you would have to chain the ```customer_id != customer_id``` in a separate call. Thus, I get an error because I'd get more than 2^31 rows when I try this: ```am_data[am_data, on = .(device_id = device_id, sent_at < sent_at), j = uniqueN(customer_id), allow.cartesian = TRUE, by = device_id]``` – Cole Apr 28 '19 at 12:04
  • I faced with this and tried to use `data.table` pipes to filter a rows with this condition. – Artem Klevtsov Apr 28 '19 at 12:38

1 Answers1

4

Final Edit: I replaced uniqueN with length(unique()). This provided fast results. Also, I had a typo on my previous edit for rule 7. I used unique(am_data) to remove duplicates and that seemed to fix everything except rule_4.

> res_2[, lapply(.SD, sum), .SDcols = 2:8]
   rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1:  17167  10448  17165      2    606  16040  17072
> res[, lapply(.SD,sum), .SDcols = 2:8]
   rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
1:  17167  10448  17165      0    606  16040  17072
am_data <- unique(am_data)

# Prepare for Rules 1 - 3 -------------------------------------------------

am_data2 <- copy(am_data)[!is.na(device_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))

# Make Rules 1-3 happen ---------------------------------------------------

self_join <- am_data2[a, 
                  on = .(device_id = a.device_id,
                         sent_at < a.sent_at),
                  allow.cartesian = TRUE
                  ,nomatch = 0L
                  ][customer_id != a.customer_id]

rule_1 = self_join[, length(unique(customer_id)), by = a.app_id]
rule_2 = self_join[rejected == 1 , length(unique(customer_id)), by = a.app_id]
rule_3 = self_join[, length(unique(person_id)), by = a.app_id]


# Prepare for Rule 4 ------------------------------------------------------

am_data2 <- copy(am_data)[!is.na(ip_address_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))
a[, a.sent_at_range := a.sent_at - 14]


# Make Rule 4 happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(ip_address_id = a.ip_address_id,
                               sent_at < a.sent_at,
                               sent_at >= a.sent_at_range),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_4 <- self_join[, length(unique(customer_id)), by = a.app_id]


# Prepare for Rule 5 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(contact_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))


# Make Rule 5 happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(contact_phone_id = a.mobile_phone_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_5 <- self_join[, length(unique(customer_id)), by = a.app_id]

# Prepare for Rule 6 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(work_phone_id)]
a <- copy(am_data)[!is.na(mobile_phone_id)]
setnames(a, paste0('a.', names(a)))


# Make Rule 6 Happen ------------------------------------------------------

self_join <- am_data2[rejected == 1
                      ][a,
                        on = .(work_phone_id = a.mobile_phone_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        ,nomatch = 0L
                        ][customer_id != a.customer_id]

rule_6 <- self_join[, length(unique(customer_id)), by = a.app_id]


# Prepare for Rule 7 ------------------------------------------------------
am_data2 <- copy(am_data)[!is.na(person_id)]
a <- copy(am_data2)
setnames(a, paste0('a.', names(a)))


# Make Rule 7 Happen ------------------------------------------------------
self_join <- am_data2[a,
                        on = .(person_id = a.person_id,
                               sent_at < a.sent_at),
                        allow.cartesian = TRUE
                        # ,nomatch = 0L
                        ][customer_id != a.customer_id & passport_id != a.passport_id]

rule_7 <- self_join[, length(unique(customer_id)), by = a.app_id]


# Combine and cast the rules we made --------------------------------------

res_2 <- dcast(rbindlist(list(rule_1, rule_2, rule_3, rule_4, rule_5, rule_6, rule_7), idcol = 'rule'), formula = a.app_id ~ rule , fill = 0L)
setnames(res_2,2:8,  paste0('rule_', 1:7))

Results

> res_2
       a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
    1:    89033      0      0      0      0      0      1      0
    2:    95775      0      0      0      0      0      1      0
    3:    96542      0      0      0      0      0      1      0
    4:   106447      0      0      0      0      0      1      0
    5:   113040      0      0      0      0      0      1      0
   ---                                                          
21925: 34904219      1      1      1      0      0      1      0
21926: 34904725      1      1      1      0      0      0      1
21927: 34904750      1      0      1      0      0      1      1
21928: 34904921      1      0      1      0      0      0      1
21929: 34905033      0      0      0      0      0      1      1
> res[order(a.app_id) & (rule_1 > 0 | rule_2 > 0 | rule_3 > 0 |
 rule_4 > 0 | rule_5 >0 | rule_6 > 0 | rule_7 > 0)]

       a.app_id rule_1 rule_2 rule_3 rule_4 rule_5 rule_6 rule_7
    1:    89033      0      0      0      0      0      1      0
    2:    95775      0      0      0      0      0      1      0
    3:    96542      0      0      0      0      0      1      0
    4:   106447      0      0      0      0      0      1      0
    5:   113040      0      0      0      0      0      1      0
   ---                                                          
22403: 34904219      1      1      1      0      0      1      1
22404: 34904725      1      1      1      0      0      0      1
22405: 34904750      1      0      1      0      0      1      1
22406: 34904921      1      0      1      0      0      0      1
22407: 34905033      0      0      0      0      0      1      1

Original: Kept as it's keyed by device and may be helpful.

This is the data.table equivalent of the SQL for rule1. I spot checked the first 5 and last 5 results and they match up.

tmp2 <- am_data[!is.na(device_id), ..cols]
tmp2[tmp2, 
        on = .(device_id = device_id,
               sent_at > sent_at),
        allow.cartesian = TRUE
     ][customer_id != i.customer_id | is.na(customer_id),
       .N,
       keyby = device_id]
Cole
  • 11,130
  • 1
  • 9
  • 24
  • 1
    I edited my response as my original answer was closer to your ```data.table``` solution instead of the actual query result. – Cole Apr 28 '19 at 20:06
  • Thank you for the solution. It's clear for me. Note my variant with `cumsum` also gives the correct result. I tried to expand this for the rules but without success. – Artem Klevtsov Apr 29 '19 at 01:02
  • 2
    I updated it to no longer have update by reference. The results are identical for rules 1 to 3 except the original SQL query has 3 app_ids which are duplicated. I also tried to expand this for the other rules and failed hard. – Cole Apr 29 '19 at 11:39
  • You can safety remove duplicates for the `app_id`. I'll update source data. – Artem Klevtsov Apr 29 '19 at 13:27
  • 1
    One more update. I'm close on rules1-6 but rule 7 is not working using the self_join method. You're approach is faster for rule_1 but I don't understand the data well enough to translate the others into data.table expressions. Thanks for the well-organized question. – Cole Apr 30 '19 at 11:57
  • Sorry for all the updates. I had a typo on rule_7. All the rules are close to your SQL query. It's also pretty fast. Still, I realize that these aren't exactly the same results as you're looking for. – Cole May 01 '19 at 02:04
  • Nice. Thanks for the participating. – Artem Klevtsov May 01 '19 at 05:48
  • I took unique on the original dataset the sums now match up except for rule_4. However, when I try ```identical```, it returns false for the others rules. – Cole May 01 '19 at 11:42