This is a question about data structures and overall approaches to a difficult data wrangling problem that I would like to tackle in R. If I've learnt anything here, it is that there's often really nice solutions to difficult problems so any advice would be much appreciated
I have orderbook data from a stock exchange and I'm trying to rebuild the price-time priority queue at any time from it. EG. There is a queue for anyone who wants to trade: The buyer (or seller) with the highest bid (or lowest sellprice) should always be the first to buy (sell) when someone else comes along to trade with them. If there is more than one person who is offering the same price, then it is the person who declared their intention to buy (or sell) earlier that is at the front of the queue. It gets a little messy because you can change the price and volume of your order after you've sent your order in. If you increase the volume of your order you should lose your place in the queue (but not if you decrease the volume of your order). So in practice, the queue can change around a lot and I'd like to accurately know who is at the front of the queue of buyers (and the same for sellers) through the day, at any time that I might need to look at.
Here is an example of the data I am working with:
library(data.table)
set.seed(1)
unique.ids <- c("B-AAA","B-AAB","B-AAC","A-AAA","A-AAB", "A-AAC")
seconds.since.midnight <- sort(sample(40000:40010, 12, replace = TRUE))
order.type <- c("ENTER", "AMEND", "DELETE", "TRADE")
DT <- data.table(order = 1:12,
time.scnds = seconds.since.midnight,
type = order.type[c(1,1,1,1,2,4,
1,2,2,4,3,
1)],
bid.id = NA,
ask.id = NA,
price = c(3.0,3.5, 3.3, 3.8,3.9,3.8, 3.8, 3.95, 3.8, 3.8,NA, 4.1),
volume = c(50,50,60,100,60,60,200,40,50,50,NA, 100),
oldprice = c(NA,NA,NA,NA,3.3,NA,NA,3.8,3.0,NA,NA,NA),
oldvolume = c(NA, NA,NA,NA,60,NA,NA,40,50,NA,NA,NA))
DT$bid.id[c(1,2,3,5,6,9,10)] <- unique.ids[c(1,2,3,3,3,1,1)]
DT$ask.id[c(4,6,7,8,10,11,12)] <- unique.ids[c(4,4,5,4,5,5,6)]
DT
order time.scnds type bid.id ask.id price volume oldprice oldvolume
1: 1 40000 ENTER B-AAA NA 3.00 50 NA NA
2: 2 40001 ENTER B-AAB NA 3.50 50 NA NA
3: 3 40002 ENTER B-AAC NA 3.30 60 NA NA
4: 4 40002 ENTER NA A-AAA 3.80 100 NA NA
5: 5 40002 AMEND B-AAC NA 3.90 60 3.3 60
6: 6 40004 TRADE B-AAC A-AAA 3.80 60 NA NA
7: 7 40006 ENTER NA A-AAB 3.80 200 NA NA
8: 8 40006 AMEND NA A-AAA 3.95 40 3.8 40
9: 9 40007 AMEND B-AAA NA 3.80 50 3.0 50
10: 10 40009 TRADE B-AAA A-AAB 3.80 50 NA NA
11: 11 40009 DELETE NA A-AAB NA NA NA NA
12: 12 40010 ENTER NA A-AAC 4.10 100 NA NA
As orders enter, amend, trade, and get deleted their order in the queue changes. I would like to end up with something like a queue for buyers and a queue for sellers. I don't know what the best data structure for this would be, but if it were a list it might look like this for example:
buyers.queue
[[1]]
[1] "B-AAA"
[[2]]
[1] "B-AAB" "B-AAA"
[[3]]
[1] "B-AAB" "B-AAC" "B-AAA"
[[4]]
[1] "B-AAB" "B-AAC" "B-AAA"
[[5]]
[1] "B-AAC" "B-AAB" "B-AAA"
[[6]]
[1] "B-AAC" "B-AAB" "B-AAA"
[[7]]
[1] "B-AAB" "B-AAA"
[[8]]
[1] "B-AAB" "B-AAA"
[[9]]
[1] "B-AAA" "B-AAB"
[[10]]
[1] "B-AAA" "B-AAB"
[[11]]
[1] "B-AAB"
[[12]]
[1] "B-AAB"
So each element of the list (it has length 12) is the queue at that row of DT and the queue is ordered by best price (in this case, best price offered by different bids, and then by time since arrival/recent amendment). Notice how the order can change around a lot.
For the sellers we would have something like this (if it were in a list- it doesn't have to be):
sellers.queue
[[1]]
[1] NA
[[2]]
[1] NA
[[3]]
[1] NA
[[4]]
[1] "A-AAA"
[[5]]
[1] "A-AAA"
[[6]]
[1] "A-AAA"
[[7]]
[1] "A-AAA" "A-AAB"
[[8]]
[1] "A-AAB" "A-AAA"
[[9]]
[1] "A-AAB" "A-AAA"
[[10]]
[1] "A-AAB" "A-AAA"
[[11]]
[1] "A-AAA"
[[12]]
[1] "A-AAA" "A-AAC"
I'm not even sure how to begin to tackle this problem or even what shape/object class the result should have. Any suggestions?
Thanks for reading