I have the dataset with the following structure:
Classes 'data.table' and 'data.frame': 214175 obs. of 12 variables:
$ app_id : int 35949 49453 49970 50913 5| __truncated__ ...
$ customer_id : int 35948 49452 49452 50912 5| __truncated__ ...
$ sent_at : POSIXct, format: "2012-07-01 23:33:32.262" "2012-08-19 09:56:37.533" "2012-08-25 02:09:41.296" "2012-08-26 09:54:35.345" ...
$ ip_address_id : int 102298 96092 96091 67963 | __truncated__ ...
$ device_id : int NA NA NA NA NA NA NA NA NA NA ...
$ person_id : int 138622 9551 9551 28228 14| __truncated__ ...
$ passport_id : int 115828 148527 148527 1464| __truncated__ ...
$ email_id : int 19097 2685 2685 163914 69| __truncated__ ...
$ mobile_phone_id : int 104954 157463 157463 1032| __truncated__ ...
$ work_phone_id : int 68337 309192 309192 11972| __truncated__ ...
$ contact_phone_id: int NA NA NA NA NA NA NA NA NA NA ...
$ rejected : logi FALSE FALSE TRUE TRUE TRUE FALSE ...
could you help me to translate this SQL script with subqueries to the data.table
expressions:
SELECT app_id,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE device_id = a.device_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id) AS rule_1,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE device_id = a.device_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id
AND rejected = 1) AS rule_2,
(SELECT count(DISTINCT person_id)
FROM am_data
WHERE device_id = a.device_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id) AS rule_3,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE ip_address_id = a.ip_address_id
AND sent_at < a.sent_at
AND sent_at >= datetime(a.sent_at, '-14 days')
AND customer_id != a.customer_id
AND rejected = 1) AS rule_4,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE contact_phone_id = a.mobile_phone_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id
AND rejected = 1) AS rule_5,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE work_phone_id = a.mobile_phone_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id
AND rejected = 1) AS rule_6,
(SELECT count(DISTINCT customer_id)
FROM am_data
WHERE person_id = a.person_id
AND passport_id != a.passport_id
AND sent_at < a.sent_at
AND customer_id != a.customer_id) AS rule_7
FROM am_data AS a;
My solution for the rule_1:
cols <- c("device_id", "customer_id", "app_id", "sent_at", "rejected")
tmp <- am_data[!is.na(device_id), ..cols]
setorder(tmp, sent_at)
tmp[, rule_1 := if (.N > 1L) cumsum(!duplicated(customer_id)) - 1L else 0L, by = "device_id"]
To get the desired results to compare data.table
solutions do the following code:
## ---- Load packages ----
library(data.table)
library(sqldf)
## ---- Load data ----
am_data <- fread("https://gitlab.com/artemklevtsov/test-dt/raw/master/data/am_data.csv")
am_data <- unique(am_data, by = "app_id") # omit duplicates
am_data[, sent_at := as.POSIXct(sent_at, format = "%Y-%m-%dT%H:%M:%OS")]
setorder(am_data, sent_at)
## ---- SQL query ----
sql_query <- readLines("https://gitlab.com/artemklevtsov/test-dt/raw/master/sql/query.sql")
idx_cols <- names(am_data)[!sapply(am_data, is.logical)]
res <- sqldf(
c(sprintf("CREATE INDEX %s_idx ON am_data(%s)", idx_cols, idx_cols),
paste(sql_query, collapse = "\n"))
)
NOTE: I will accept any answer containing a solution that matches the correct result.
Links: