Filling missing data in a table

Question

I have a problem with missing data. So I have a returns table created this way:

returns:update logret:log ret from update ret:{0.0, 1_deltas x} mid by sym from spots;

and looking like this:

meta returns
c     | t f a
------| -----
date  | d    
time  | p    
sym   | s    
mid   | f    
ret   | f    
logret| f

and I pivot it this way:

rettbl:0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns;

so that works on a small range of time, but when I extend it fails with a "type: Mismatched types"

I believe the problem is that I don't have the same observations for all sym in my returns table, as
select count time by sym from returns
gives me a different count by sym.

So my question is: how would you go about filling any missing mid for all times I have in my table returns? To make it quick and dirty i was thinking about having a linear interpolation for missing points between two `time ...
Any hint, link, etc... would be greatly appreciated.

EDIT:
I can't share the data, but this should allow to reproduce the issue:

mid1: 1.2 + ({rand 1.0} each til 10) %100
mid2: 0.8 + ({rand 1.0} each til 10) %100
mid3: 104 + ({rand 1.0} each til 10) %100
sym1:{`$"EUR/USD"} each til 10
sym2:{`$"GBP/USD"} each til 10
sym3:{`$"USD/JPY"} each til 10
time1:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10
/time2:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10
time2:(2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 3) , (2020.07.06D00:00:00.000000000 + 1D00:00:00 * til 3) , (2020.07.10D00:00:00.000000000 + 1D00:00:00 * til 4)
time3:2020.07.01D00:00:00.000000000 + 1D00:00:00 * til 10

spots:([] sym:sym1,sym2,sym3; time:time1,time2,time3; mid:mid1,mid2,mid3)
spots:update date:"d"$time from spots
returns:update logret:log ret from update ret:{0.0, 1_deltas x} mid by sym from spots;
rettbl:0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns;

do note the time2 definition.
as per the desired output, if you replace the time2 definition with the one I commented out you will see.
as per the solution, I suspect that adding to the table returns the missing time, so that the list of time is the same for all sym are identical will solve the issue, ie value select time by sym from returns should be identical, and a quick a dirty work around would be to add missing mid with a linear interpolation of the surrounding mid.

also an expected output in order to understand what you want — Matt Moore, Nov 27 '20 at 17:55
Hi Mat. I did my best to create a table to reproduce the issue. I hope I'm clear about the cause of the problem (the time by sym are not the same) and about what im trying to achieve to solve it: the filling of the returns tables so that all time of the returns table are present for all sym, and the missing mid would be interpolated linearly. I believe the pivot wil work in these conditions. — Will, Nov 27 '20 at 18:28

Matt Moore · Answer 1 · 2020-11-29T21:25:09.463

The P# is pretty crucial here in order to produce the pivot properly. This will place nulls where time is missing for a sym col. https://code.kx.com/q/kb/pivoting-tables/

{

    P:exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret)) from returns;
    exec P#(raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns

    }[]

Output:
time                         | EUR/USD_mid GBP/USD_mid USD/JPY_mid EUR/USD_re..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| 1.203915    0.8078337   104.0062    0         ..
2020.07.02D00:00:00.000000000| 1.200812    0.8040996   104.0093    -0.0031030..
2020.07.03D00:00:00.000000000| 1.209368    0.8061088   104.0027    0.00855514..
2020.07.04D00:00:00.000000000| 1.202782                104.0006    -0.0065853..
2020.07.05D00:00:00.000000000| 1.202392                104.0026    -0.0003897..
2020.07.06D00:00:00.000000000| 1.201508    0.8049765   104.0023    -0.0008842..
2020.07.07D00:00:00.000000000| 1.201567    0.8040875   104.0009    5.91839e-0..
2020.07.08D00:00:00.000000000| 1.209785    0.8044973   104.001     0.00821768..
2020.07.09D00:00:00.000000000| 1.207043                104.0087    -0.0027416..
2020.07.10D00:00:00.000000000| 1.209442    0.8001392   104.0073    0.00239835..
2020.07.11D00:00:00.000000000|             0.8071488                         ..
2020.07.12D00:00:00.000000000|             0.8019465                         ..

You are getting a type error because the code is creating a dictionary when time is not present for all syms.

q)0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
'type
  [0]  0!exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
        ^
q)exec (raze `$raze ( string (distinct sym)),\:/:("_",/:string `mid`ret`logret))!(mid, ret, logret) by time:time from returns
time                         |                                               ..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.02D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.03D00:00:00.000000000| `EUR/USD_mid`GBP/USD_mid`USD/JPY_mid`EUR/USD_r..
2020.07.04D00:00:00.000000000| `EUR/USD_mid`USD/JPY_mid`EUR/USD_ret`USD/JPY_r..
2020.07.05D00:00:00.000000000| `EUR/USD_mid`USD/JPY_mid`EUR/USD_ret`USD/JPY_r..

Edit: To remove nulls you could use fills to fill forward non-null values. You also may want to use .Q.id to sanitize the table first. Columns such as EUR/USD_mid are tricky to work with due to the confusion with over /.

https://code.kx.com/q/ref/dotq/#qid-sanitize

q)fills x
time                         | EURUSD_mid GBPUSD_mid USDJPY_mid EURUSD_ret   ..
-----------------------------| ----------------------------------------------..
2020.07.01D00:00:00.000000000| 1.203928   0.8049318  104.0047   0            ..
2020.07.02D00:00:00.000000000| 1.205171   0.8057852  104.0063   0.001243387  ..
2020.07.03D00:00:00.000000000| 1.20516    0.8008389  104.0097   -1.11147e-005..
2020.07.04D00:00:00.000000000| 1.204067   0.8008389  104.0023   -0.001093155 ..
2020.07.05D00:00:00.000000000| 1.201781   0.8008389  104.0095   -0.002285803 ..

This will remove most nulls except for any that exist in the first row. I don't have any experience writing something such as linear interpolation in kdb and I don't think it will replace nulls as well as fills. For example, the example data set has 3 nulls at the end in the EURUSD_mid column so can't be interpolated for those 3 entries.

Thank you Mat. got it for the P# . but the thing is, it's not really solving my problem which is to fill the missing mid. so after you pivot, on your output, the 11th and 12th for EUR/USD_mid are null, and the 4th and 5th for GBP/USD_mid are also null. And what I would like to do is to fill all these 0Nf with linearly interpolated values. for ex 4th GBP/USD_mid would be equal to 0.8061088 + (0.804977-0.806109)/3 = 0.805731, and 11th of jul of EUR/USD_mid would be 1.206442 + (1.209442 - 1.207043). maybe it s too generic a question, but any pointer would be useful. Thanks in any case for the P# — Will, Nov 27 '20 at 23:41

Filling missing data in a table

1 Answers1