3

This is a question about data structures and overall approaches to a difficult data wrangling problem that I would like to tackle in R. If I've learnt anything here, it is that there's often really nice solutions to difficult problems so any advice would be much appreciated

I have orderbook data from a stock exchange and I'm trying to rebuild the price-time priority queue at any time from it. EG. There is a queue for anyone who wants to trade: The buyer (or seller) with the highest bid (or lowest sellprice) should always be the first to buy (sell) when someone else comes along to trade with them. If there is more than one person who is offering the same price, then it is the person who declared their intention to buy (or sell) earlier that is at the front of the queue. It gets a little messy because you can change the price and volume of your order after you've sent your order in. If you increase the volume of your order you should lose your place in the queue (but not if you decrease the volume of your order). So in practice, the queue can change around a lot and I'd like to accurately know who is at the front of the queue of buyers (and the same for sellers) through the day, at any time that I might need to look at.

Here is an example of the data I am working with:

library(data.table)

set.seed(1)
unique.ids <- c("B-AAA","B-AAB","B-AAC","A-AAA","A-AAB", "A-AAC")
seconds.since.midnight <- sort(sample(40000:40010, 12, replace = TRUE))
order.type <- c("ENTER", "AMEND", "DELETE", "TRADE")

DT <- data.table(order = 1:12,
                 time.scnds = seconds.since.midnight,
                 type = order.type[c(1,1,1,1,2,4,
                                     1,2,2,4,3,
                                     1)],
                 bid.id = NA,
                 ask.id = NA,
                 price = c(3.0,3.5, 3.3, 3.8,3.9,3.8, 3.8, 3.95, 3.8, 3.8,NA, 4.1), 
                 volume = c(50,50,60,100,60,60,200,40,50,50,NA, 100),
                 oldprice = c(NA,NA,NA,NA,3.3,NA,NA,3.8,3.0,NA,NA,NA),
                 oldvolume = c(NA, NA,NA,NA,60,NA,NA,40,50,NA,NA,NA))


DT$bid.id[c(1,2,3,5,6,9,10)] <- unique.ids[c(1,2,3,3,3,1,1)]
DT$ask.id[c(4,6,7,8,10,11,12)] <- unique.ids[c(4,4,5,4,5,5,6)]
DT

    order time.scnds   type bid.id ask.id price volume oldprice oldvolume
 1:     1      40000  ENTER  B-AAA     NA  3.00     50       NA        NA
 2:     2      40001  ENTER  B-AAB     NA  3.50     50       NA        NA
 3:     3      40002  ENTER  B-AAC     NA  3.30     60       NA        NA
 4:     4      40002  ENTER     NA  A-AAA  3.80    100       NA        NA
 5:     5      40002  AMEND  B-AAC     NA  3.90     60      3.3        60
 6:     6      40004  TRADE  B-AAC  A-AAA  3.80     60       NA        NA
 7:     7      40006  ENTER     NA  A-AAB  3.80    200       NA        NA
 8:     8      40006  AMEND     NA  A-AAA  3.95     40      3.8        40
 9:     9      40007  AMEND  B-AAA     NA  3.80     50      3.0        50
10:    10      40009  TRADE  B-AAA  A-AAB  3.80     50       NA        NA
11:    11      40009 DELETE     NA  A-AAB    NA     NA       NA        NA
12:    12      40010  ENTER     NA  A-AAC  4.10    100       NA        NA

As orders enter, amend, trade, and get deleted their order in the queue changes. I would like to end up with something like a queue for buyers and a queue for sellers. I don't know what the best data structure for this would be, but if it were a list it might look like this for example:

buyers.queue
[[1]]
[1] "B-AAA"

[[2]]
[1] "B-AAB" "B-AAA"

[[3]]
[1] "B-AAB" "B-AAC" "B-AAA"

[[4]]
[1] "B-AAB" "B-AAC" "B-AAA"

[[5]]
[1] "B-AAC" "B-AAB" "B-AAA"

[[6]]
[1] "B-AAC" "B-AAB" "B-AAA"

[[7]]
[1] "B-AAB" "B-AAA"

[[8]]
[1] "B-AAB" "B-AAA"

[[9]]
[1] "B-AAA" "B-AAB"

[[10]]
[1] "B-AAA" "B-AAB"

[[11]]
[1] "B-AAB"

[[12]]
[1] "B-AAB"

So each element of the list (it has length 12) is the queue at that row of DT and the queue is ordered by best price (in this case, best price offered by different bids, and then by time since arrival/recent amendment). Notice how the order can change around a lot.

For the sellers we would have something like this (if it were in a list- it doesn't have to be):

sellers.queue

[[1]]
[1] NA

[[2]]
[1] NA

[[3]]
[1] NA

[[4]]
[1] "A-AAA"

[[5]]
[1] "A-AAA"

[[6]]
[1] "A-AAA"

[[7]]
[1] "A-AAA" "A-AAB"

[[8]]
[1] "A-AAB" "A-AAA"

[[9]]
[1] "A-AAB" "A-AAA"

[[10]]
[1] "A-AAB" "A-AAA"

[[11]]
[1] "A-AAA"

[[12]]
[1] "A-AAA" "A-AAC"

I'm not even sure how to begin to tackle this problem or even what shape/object class the result should have. Any suggestions?

Thanks for reading

user5364303
  • 75
  • 1
  • 8

1 Answers1

0

Typing your query into Google (I typed "modelling order book data in R") returns several links. The 2nd points to the orderbook R package and this paper. Hope that helps and shows you how they did it at least. If not, then please do come back with another question showing that you've searched the space.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • @ matt-dowle Thanks for the reply. The orderbook package doesn't have the functionality that I need (I need to identify the orders that the Market Depth is composed of (and in order of priority)). I'll post an answer to this problem if/when I have one – user5364303 Feb 10 '16 at 03:35
  • 1
    @user5364303 Then you're probably best off with commercial products; e.g. KDB, OneTick. R doesn't have the record-by-record update performance you need for this. Or go straight to C, Rcpp, Cython and manage over-allocated vectors yourself directly. – Matt Dowle Feb 10 '16 at 20:20