1

I am desperately trying to reproduce the classical Pandas example of rolling joins, where quotes data is merged with trade data.

See here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html

Here is the data in data.table format:

trades <- data.table(time = c('2016-05-25 13:30:00.023',
                              '2016-05-25 13:30:00.038',
                              '2016-05-25 13:30:00.048',
                              '2016-05-25 13:30:00.048',
                              '2016-05-25 13:30:00.048'),
                     ticker = c('MSFT','MSFT','GOOG','GOOG','AAPL'),
                     price = c(51.95,51.95,720.77,720.92,98.0),
                     quantity = c(75,155,100,100,100))
> trades
                      time ticker  price quantity
1: 2016-05-25 13:30:00.023   MSFT  51.95       75
2: 2016-05-25 13:30:00.038   MSFT  51.95      155
3: 2016-05-25 13:30:00.048   GOOG 720.77      100
4: 2016-05-25 13:30:00.048   GOOG 720.92      100
5: 2016-05-25 13:30:00.048   AAPL  98.00      100

and quotes

quotes <- data.table(time = c('2016-05-25 13:30:00.023',
                              '2016-05-25 13:30:00.023',
                              '2016-05-25 13:30:00.030',
                              '2016-05-25 13:30:00.041',
                              '2016-05-25 13:30:00.048',
                              '2016-05-25 13:30:00.049',
                              '2016-05-25 13:30:00.072',
                              '2016-05-25 13:30:00.075'),
                     ticker = c('GOOG','MSFT','MSFT','MSFT','GOOG','AAPL','GOOG','MSFT'),
                     bid = c(720.50, 51.95, 51.97, 51.99, 720.5,97.99,720.5,52.01),
                     ask = c(270.93,51.96,51.98,52.00,720.93,98.01,720.88,52.03))
> quotes
                      time ticker    bid    ask
1: 2016-05-25 13:30:00.023   GOOG 720.50 270.93
2: 2016-05-25 13:30:00.023   MSFT  51.95  51.96
3: 2016-05-25 13:30:00.030   MSFT  51.97  51.98
4: 2016-05-25 13:30:00.041   MSFT  51.99  52.00
5: 2016-05-25 13:30:00.048   GOOG 720.50 720.93
6: 2016-05-25 13:30:00.049   AAPL  97.99  98.01
7: 2016-05-25 13:30:00.072   GOOG 720.50 720.88
8: 2016-05-25 13:30:00.075   MSFT  52.01  52.03

What I would like to do is merge the trades data with the quotes data in the following way

  1. for each trade, match the closest previous quote possible
  2. the matched quote must be within 10 millisecond
  3. exact matches should not occur.

The output (which is the same as in the Pandas tutorial) should be

                      time ticker  price quantity   bid   ask
1: 2016-05-25 13:30:00.023   MSFT  51.95       75    NA    NA
2: 2016-05-25 13:30:00.038   MSFT  51.95      155 51.97 51.98
3: 2016-05-25 13:30:00.048   GOOG 720.77      100    NA    NA
4: 2016-05-25 13:30:00.048   GOOG 720.92      100    NA    NA
5: 2016-05-25 13:30:00.048   AAPL  98.00      100    NA    NA

Indeed, you can see that the only possible quote match is for the second trade at 2016-05-25 13:30:00.038, because the closed (previous) quote occurs at 2016-05-25 13:30:00.030 so this is within 10ms (and not an exact match).

Despite my trials, I was not able to reproduce this in data.table. Any ideas? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

3 Answers3

2

You could also combine this idiom with a rolling join, which is similar but not quite equal to what @sindri_baldur proposed:

library(lubridate)
library(data.table)

quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]

match_inexact <- function(q_time, t_time, bid, ask) {
  exact <- q_time == t_time # exact matches get NA
  bid[exact] <- NA_real_
  ask[exact] <- NA_real_
  list(bid, ask)
}

trades[, c("bid", "ask") := quotes[.SD,
                                   match_inexact(x.time, i.time, x.bid, x.ask),
                                   on = .(ticker, time),
                                   roll = lubridate::dmilliseconds(10L)]]

An important thing to note: time is the last column specified for the join because that's the column where data.table will try to roll values.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • wow pretty good. Do you mind explaning a little bit more whats going on? – ℕʘʘḆḽḘ Jul 02 '19 at 18:00
  • 1
    In this case the rolling join notation (active when you specify `roll`) means: for the last column used to join, look behind by at most 10ms if there's no exact match. With `match_inexact` we can simply check whether the times were rolled or not: if `x.time == i.time`, no roll was done, it was an exact match. The idiom just assigns the resulting columns (each packed in a list) to `trades` by reference. – Alexis Jul 02 '19 at 18:07
  • ok got it. but one point, what does the i.time refers to? the time variable in the left DT? – ℕʘʘḆḽḘ Jul 02 '19 at 18:08
  • 2
    Ah, in the `data.table` documentation they usually describe things with variables `x[i, j, by]`, where `i` can be another `data.table`. During the join, they expose the variables with `x.` and `i.` prefixes, depending on which table the value was taken from. So `i.time` is from `.SD`, which represents the subset of data coming from `trades`. – Alexis Jul 02 '19 at 18:10
  • got it. and I guess x. refers to the left DT? in this case `trades`? – ℕʘʘḆḽḘ Jul 02 '19 at 19:24
  • `x.` does refer to the left, but the actual join is nested and specified as `quotes[.SD, ...`, so it would refer to `quotes`... – Alexis Jul 02 '19 at 20:23
1

Here is something (quick and dirty) that gets the job done:

# Format as POSIXct*
quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]

# Match the nearest time (in the right direction) for each ticker and add as column
trades[quotes, on = .(time > time, ticker), qtime := i.time]
# Remove if not within time limit (10 millsecs)
trades[(time - qtime) > 0.01, qtime := NA_real_]
# Now perform an equi-join after removing timestamp that was too distant
trades[, c("bid", "ask") := quotes[trades, on = .(time = qtime), .(bid, ask)]]
trades[, !"qtime"] # drop this temporary column

#                   time ticker  price quantity   bid   ask
# 1: 2016-05-25 13:30:00   MSFT  51.95       75    NA    NA
# 2: 2016-05-25 13:30:00   MSFT  51.95      155 51.97 51.98
# 3: 2016-05-25 13:30:00   GOOG 720.77      100    NA    NA
# 4: 2016-05-25 13:30:00   GOOG 720.92      100    NA    NA
# 5: 2016-05-25 13:30:00   AAPL  98.00      100    NA    NA

* POSIXct vectors are built on top of double vectors, where the value represents the number of seconds since 1970-01-01

Learning from Alexis' post here is a slightly cleaner version that uses the roll argument.

trades[, c("qtime", "bid", "ask") := quotes[.SD, roll = 0.01, on = .(ticker, time), .(x.time, bid, ask)]]
trades[time == qtime, c("bid", "ask") := NA_real_][, qtime := NULL]
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • interesting. thanks! could you please explain what you do there? what does the `qtime := i.time` do? why cant we just do it in one pass with some `roll` arguments? thanks! – ℕʘʘḆḽḘ Jul 02 '19 at 15:04
  • Will try to add some more comments. Maybe there is a way to do it with one roll argument. I am a beginner with non-equi joins. – s_baldur Jul 02 '19 at 15:10
  • same idea here. `time > time` is a bit ambiguous. how can I make sure I am referring to `time` of `quotes` or `time` of `trades`? – ℕʘʘḆḽḘ Jul 03 '19 at 11:26
  • Agree. But in this case the first time refers to `trades`. Anyway I would go with the second *cleaner* version. – s_baldur Jul 03 '19 at 12:30
1

Another possible non-equi join approach using the latest quote within that 10ms window:

options(digits.secs=3) #see https://stackoverflow.com/a/43475068/1989480
library(data.table)

quotes[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")]
trades[, time := as.POSIXct(time, format="%Y-%m-%d %H:%M:%OS", tz = "GMT")][,
    c("start", "end") := .(time-0.01, time)]

trades[, c("bid", "ask") :=
        quotes[trades, on=.(ticker, time>=start, time<end), mult="last", .(bid, ask)]
    ][, c("start", "end") := NULL]

output:

                      time ticker  price quantity   bid   ask
1: 2016-05-25 13:30:00.023   MSFT  51.95       75    NA    NA
2: 2016-05-25 13:30:00.038   MSFT  51.95      155 51.97 51.98
3: 2016-05-25 13:30:00.048   GOOG 720.77      100    NA    NA
4: 2016-05-25 13:30:00.048   GOOG 720.92      100    NA    NA
5: 2016-05-25 13:30:00.048   AAPL  98.00      100    NA    NA
chinsoon12
  • 25,005
  • 4
  • 25
  • 35