Fill secondly data from Q KDB+

Question

I have a csv file with some high frequency stock price data, and I'd like to get a secondly price data from the table.

In each file, there are columns named date, time, symbol, price, volume, and etc.

There are some seconds with no trading so there are missing data in some seconds.

I'm wondering how could I fill the missing data in Q to get the secondly data from 9:30 to 16:00 in full? If there is missing price, just use the recently price as its price in that second.

I'm considering to write some loop, but I don't know how to exactly to that.

Can you add an example of any code that you already have, along with the code where you are currently pulling the data? Read https://stackoverflow.com/help/mcve for information on how to write a good question. — Steve Mitcham, Jan 20 '15 at 18:38

Manish Patel · Answer 1 · 2015-01-21T09:16:12.000

Simplifying a little, I'll assume you have some random timestamps in your dataset like this:

time                            price
--------------------------------------
2015.01.20D22:42:34.776607000   7
2015.01.20D22:42:34.886607000   3
2015.01.20D22:42:36.776607000   4
2015.01.20D22:42:37.776607000   8
2015.01.20D22:42:37.886607000   7
2015.01.20D22:42:39.776607000   9
2015.01.20D22:42:40.776607000   4
2015.01.20D22:42:41.776607000   9

so there are some missing seconds there. I'm going to call this table t. So if you do a by-second type of query, obviously the seconds that are missing are still missing:

q)select max price by time.second from t
second  | price
--------| -----
22:42:34| 7
22:42:36| 4
22:42:37| 8
22:42:39| 9
22:42:40| 4
22:42:41| 9

To get missing seconds, you have to join a list of nulls. In this case we know the data goes from 22:42:34 to 22:42:41, but in reality you'll have to find the min/max time and use that to create a temporary "null" table to join against:

q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N)
second   price
--------------
22:42:34
22:42:35
22:42:36
22:42:37
22:42:38
22:42:39
22:42:40
22:42:41

Then left join:

q)([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second   price
--------------
22:42:34 7
22:42:35
22:42:36 4
22:42:37 8
22:42:38
22:42:39 9
22:42:40 4
22:42:41 9

You can use fills or whatever your favourite filling heuristic is after that.

q)fills `second xasc asc ([] second:22:42:34 + til 1+`int$22:42:41-22:42:34 ; price:(1+`int$22:42:41-22:42:34)#0N) lj select max price by time.second from t
second   price
--------------
22:42:34 7
22:42:35 7
22:42:36 4
22:42:37 8
22:42:38 8
22:42:39 9
22:42:40 4
22:42:41 9

(Note the sort on second before fills!)

By the way for larger tables this will be much faster than a loop. Loops in q are generally a bad idea.

EDIT

You could use a comma join too, both tables need to be keyed on the second column

t,t1

(where t1 is the null-filled table keyed on second)

I haven't tested it, but I suspect it would be slightly faster than the lj version.

aj will also work - make sure the table is sorted though, because it does a binary search to find the nearest match. It will be much slower than my solution too :-) — Manish Patel, Jan 21 '15 at 21:55

score 0 · Answer 2 · answered Sep 22 '18 at 21:58

Using aj which is one of the most powerful features of KDB:

q)data
sym  time     price     size
----------------------------
MS   10:24:04 93.35974  8
MS   10:10:47 4.586986  1
APPL 10:50:23 0.7831685 1
GOOG 10:19:52 49.17305  0

in-memory table needs to be sym,time sorted with g# attribute applied to sym column

q)data:update `g#sym from `sym`time xasc data
q)meta trade
c    | t f a
-----| -----
sym  | s   g
time | v
price| f
size | j

Creating a rack table intervalized per second per sym :

q)rack: `sym`time xasc (select distinct sym from data) cross   ([] time:{x[0]+til `int$x[1]-x[0]}(min;max)@\:data`time)

Using aj to join the data :

q)aj[`sym`time; rack; data]

Fill secondly data from Q KDB+

2 Answers2