Is there a more scalable sub-select alternative to this query?

Question

I have a large table that contains a date field of type datetime. As part of a function that takes as input two lists of datetime namely a list of afroms and atos I'd like to compute for each of these afrom,ato pairs all the rows of the large table whose date is between them.

I worked out a not very efficient way to do this i.e. it has a serious scalability drawback:

/ t1 contains my afrom,ato pairs 
q)t1:([] afrom:`datetime$(2017.10.01T10:00:00.000 2017.10.02T10:00:00.000);ato:`datetime$(2017.10.01T12:00:00.000 2017.10.02T12:00:00.000));
q)t1
afrom                   ato                    
-----------------------------------------------
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000
2017.10.02T10:00:00.000 2017.10.02T12:00:00.000

/ t2 contains my very very large dataset  
q)t2:([] date:`datetime$(2017.10.01T10:01:00.000 2017.10.01T10:02:00.000 2017.10.01T10:03:00.000 2017.10.02T10:01:00.000 2017.10.02T10:02:00.000 2017.10.02T10:03:00.000); ccypair:(3#`EURUSD),(3#`USDCHF); mid:6?1.05);
q)t2
date                    ccypair mid                 
----------------------------------------------------
2017.10.01T10:01:00.000 EURUSD  0.24256133290473372 
2017.10.01T10:02:00.000 EURUSD  0.091602176288142809
2017.10.01T10:03:00.000 EURUSD  0.10756538207642735 
2017.10.02T10:01:00.000 USDCHF  0.91046513157198206 
2017.10.02T10:02:00.000 USDCHF  0.76424539103172717 
2017.10.02T10:03:00.000 USDCHF  0.17090452200500295

Then I can use cross like this:

select from (t1 cross t2) where afrom<date,date<ato

and this produces the correct results:

afrom                   ato                     date                    ccypa..
-----------------------------------------------------------------------------..
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000 2017.10.01T10:01:00.000 EURUS..
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000 2017.10.01T10:02:00.000 EURUS..
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000 2017.10.01T10:03:00.000 EURUS..
2017.10.02T10:00:00.000 2017.10.02T12:00:00.000 2017.10.02T10:01:00.000 USDCH..
2017.10.02T10:00:00.000 2017.10.02T12:00:00.000 2017.10.02T10:02:00.000 USDCH..
2017.10.02T10:00:00.000 2017.10.02T12:00:00.000 2017.10.02T10:03:00.000 USDCH..

However, when I have a large list of afroms and atos the cross will "unnecessarily" expand the potentially large table t2 times the size of t1 and that doesn't scale well.

Is there a better way to do this? e.g. I have tried things like:

select from t2 where (exec afrom from t1)<date,date<(exec ato from t1)
error: `length

I would need to do a loop I guess but not sure how. Subquestion .. is it possible to have a sigle list of interval tuples i.e. intervals(afrom;ato) instead of having separated afroms and atos?

score 1 · Accepted Answer · edited Dec 09 '17 at 17:31

1

If I understand this correctly then one way to do this would be use each row of t1 and find where any row falls within each time range:

select from t2 where any date within/:value each t1
date                    ccypair mid
-----------------------------------------
2017.10.01T10:01:00.000 EURUSD  0.41239
2017.10.01T10:02:00.000 EURUSD  0.5429457
...

Which is the same as yoru example output above, without the afrom and ato columns. In this example within selects all values within a range and any and each-right /: allow you to use multiple ranges. If this does not scale particularly well for you then you can work on each row of t1 individually:

raze{[x;y]select from x where date within value y}[t2]'[t1]

Which should work if the windows do not overlap.

If you need exclusive or then you could try modifying the window times slightly to ensure they are excluded:

q)select from t2 where any date within/:value each @[t1;`afrom`ato;+;1 -1*00:00:00.001]
date                    ccypair mid
-----------------------------------------
2017.10.01T10:01:00.000 EURUSD  0.41239
2017.10.01T10:02:00.000 EURUSD  0.5429457
...

To add the afrom and ato columns to the output you can cross them with the selected rows:

raze{[x;y]flip[1#'y]cross select from x where date within value[y]+1 -1*00:00:00.001}[t2]'[t1]
afrom                   ato                     date                    ccypair mid
-----------------------------------------------------------------------------------------
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000 2017.10.01T10:01:00.000 EURUSD  0.41239
2017.10.01T10:00:00.000 2017.10.01T12:00:00.000 2017.10.01T10:02:00.000 EURUSD  0.5429457
...

edited Dec 09 '17 at 17:31

SkyWalker

13,729
18
91
187

answered Dec 06 '17 at 10:32

Thomas Smyth - Treliant

4,993
6
25
36

Thanks for your answer! Note that I need **exclusive** within so unfortunately `within` doesn't work for me. – SkyWalker Dec 06 '17 at 11:01
In practice does this mean if a value fell into 2 windows it would be excluded, or does it mean that the boundary times are excluded? – Thomas Smyth - Treliant Dec 06 '17 at 11:22
Hi Thomas, thanks for reaching out. The boundary times are excluded so can't use `within`. When the `afrom,ato` has no matches then it gives no rows. – SkyWalker Dec 06 '17 at 14:37
In the example I included in my edit I modified the boundary times by the smallest time increment, this should allow you to use within as the actual boundary times would now be excluded from a query. – Thomas Smyth - Treliant Dec 06 '17 at 14:40
I also need `afrom` and `ato` as part of the output columns ... how would you integrate that into your solution? – SkyWalker Dec 07 '17 at 14:11
Added an additional code block to the answer, but the gist of it is that I crossed those times with the filtered list of data. This way you don't have to cross the times with all of the data. – Thomas Smyth - Treliant Dec 07 '17 at 22:58
Thank you so much! Great answer and clever solution that reveals the Q way to tackle problems. I made a small cosmetic change to keep the anonymous function clean. – SkyWalker Dec 09 '17 at 17:48

Is there a more scalable sub-select alternative to this query?

1 Answers1