rewrite a kdb script in DolphinDB database

Question

I am trying to rewrite a kdb script in DolphinDB.

Let me first explain what I need to do. If a signal is above a threshold T1, we establish a long position in the security. We don't want to close the position as soon as the signal dips below T1, so we give it a cushion: only when the signal dips below T10 that is less than T1 do we close the position.

On the other side, if the signal is below a threshold T2, we establish a short position. Only when the signal moves above T20>T2 do we close the position.

T1>T10>T20>T2.

Basically I need the following vector:

 - if signal>T1, return 1. Subsequent elements are 1 until when signal<T10; 
 - if signal<T2, return -1. Subsequent elements are -1 until when signal>T20;
 - 0 otherwise

The kdb script for the task above is:

0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

Does anyone how to rewrite it in DolphinDB?

score 1 · Answer 1 · answered Aug 17 '19 at 06:33

I did literal translation in DolphinDB version 0.97.4

eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

iif(cond, trueResult, falseResult) is a element-wise conditional function. 00h denotes a null value in short type. nullFill(X, Y) replace null values in X by corresponding value in Y. ffill(X) replace null values in X by preceding values. Both loop and eachPost are two higher order functions.

Testing case in DolphinDB

T1= 60
T10 = 50
T20 = 30
T2 = 20
signal = 10 20 70 59 42 49 19 25 26  35
eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

-1 -1 1 1 0 0 -1 -1 -1 0

Testing case in KDB+

T1:60
T10:50
T20:30
T2:20
signal:10 20 70 59 42 49 19 25 26  35
0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

-1 -1 1 1 0 0 -1 -1 -1 0

I also did a quick performance comparison. I generated 10 millions random signal and ran the above expression in DolphinDB and KDB+, respectively. KDB+ took 800ms whereas DolphinDB took only 480ms. Below are the performance testing code.

//DolphinDB
T1= 60
T10 = 50
T20 = 30
T2 = 20
signal = 1 + rand(99.0, 10000000)
timer eachPost(-, loop(nullFill, [iif(signal<T10, 0h, 00h), iif(signal>T20, 0h, 00h)], [iif(signal>T1, 1h, 00h), iif(signal<T2, 1h, 00h)]))[0].ffill().nullFill(0h)

//KDB+
T1:60
T10:50
T20:30
T2:20
signal: 1.0 + 10000000 ? 99.0
\t  0h^fills(-).(0N 1h)[(signal>T1;signal<T2)]^'(0N 0h)[(signal<T10;signal>T20)]

Jackson · Answer 2 · 2019-11-10T11:30:26.593

Update 9/11/2019

Note updated version

WITH 
    rand()/4294967295*100 AS s,
    60 AS t1, 
    50 AS t10, 
    30 AS t20, 
    20 AS t2, 
    if(s < t10, 0, if(s > t1, 1, NULL)) as signal1,
    if(s > t20, 0, if(s < t2, 1, NULL)) as signal2
SELECT
    arrayFill(x -> (x != -2), groupArray(toInt8(ifNull(signal1 - signal2, -2)))) as k
FROM numbers_mt(10000000);

Updated based on feedback from Summer.H. Following timings best of 5 runs on each on my system (Core i7-7820X). Very little material difference in the timings between each.

10 million signals

Random generation + computation

1 thread - DolphinDB 337ms, ClickHouse 291ms
2 threads (1 core) - DolphinDB 226ms, ClickHouse 189ms

Computation only

1 thread - DolphinDB 302ms, ClickHouse 233ms
2 threads (1 core) - DolphinDB 179ms, ClickHouse 165ms

250 million signals

Random generation + computation

1 thread - DolphinDB 7.901s, ClickHouse 7.103s
2 threads (1 core) - DolphinDB 4.786s, ClickHouse 4.297s
4 threads (2 cores) - ClickHouse 2.965s

Computation only

1 thread - DolphinDB 7.106s, ClickHouse 5.564s
2 threads (1 core) - DolphinDB 3.966s, ClickHouse 3.668s
4 threads (2 cores) - ClickHouse 2.573s

Original

Note this doesn't answer the specific question as it relates to DolphinDB - but here's a version using ClickHouse as well.

WITH 
    60 AS t1, 
    50 AS t10, 
    30 AS t20, 
    20 AS t2, 
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM 
(
    SELECT arrayJoin([10, 20, 70, 59, 42, 49, 19, 25, 26, 35]) AS s
)
FORMAT TSV

[-1,-1,1,1,0,0,-1,-1,-1,0]

Benchmarking 10 million random samples. Fastest DolphinDB answer by Summer.H on my system, best run out of 5:

DolphinDB (4 threads)

./dolphindb
DolphinDB Systems 0.99.0 64 bit Copyright (c) 2011~2019 DolphinDB, Inc. Licensed to Trial User. Expires on 2019.12.31 (Build:2019.10.25)

>timer t1= 60
timer t10 = 50
timer t20 = 30
timer t2 = 20
timer signal = rand(100.0, 10000000)
timer direction = (iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))).ffill().nullFill(0h)
;>>>>>>
Time elapsed: 0.01 ms
Time elapsed: 0.001 ms
Time elapsed: 0.001 ms
Time elapsed: 0.001 ms
Time elapsed: 72.675 ms
Time elapsed: 305.442 ms

Total time: 378 ms

ClickHouse (restricted 4 threads like DolphinDB), best run out of 5:

CREATE TEMPORARY TABLE dtest2 AS
WITH
    rand()%100 + rand()/4294967295 AS s,
    60 AS t1,
    50 AS t10,
    30 AS t20,
    20 AS t2,
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM numbers_mt(10000000)

Ok.

0 rows in set. Elapsed: 0.300 sec. Processed 10.00 million rows, 80.00 MB (33.37 million rows/s., 266.94 MB/s.)

Total time 300 ms

ClickHouse, default configuration/no thread limit, best run out of 5:

CREATE TEMPORARY TABLE dtest2 AS
WITH
    rand()%100 + rand()/4294967295 AS s,
    60 AS t1,
    50 AS t10,
    30 AS t20,
    20 AS t2,
    ([if(s < t10, 0, NULL), if(s > t20, 0, NULL)], [if(s > t1, 1, NULL), if(s < t2, 1, NULL)]) AS signal
SELECT arrayFill(x -> (x != -2), groupArray(ifNull(coalesce((signal.1)[1], (signal.2)[1]) - coalesce((signal.1)[2], (signal.2)[2]), -2))) AS k
FROM numbers_mt(10000000)

Ok.

0 rows in set. Elapsed: 0.191 sec. Processed 10.00 million rows, 80.00 MB (52.22 million rows/s., 417.74 MB/s.)

Final timings from each:

kdb (800ms), DolphinDB (480ms, 378ms, 330ms?), ClickHouse (191ms)

I further benchmarked here with 250 million random signals - DolphinDB took 9214.99ms (1420ms random signal generation + 7794.24ms calculation). ClickHouse took 4272 ms total for random generation and calculation.

Was limited with my DolphinDB licence but ClickHouse managed 1 billion signals in 17.4 seconds (memory table) or 20.1 seconds (to disk).

Interesting result! Both DolphinDB and kdb+ actually run the script in single thread even though DolphinDB community version offers up to 2 threads (not 4). DolphinDB database offers the function `pcall` to parallelize the execution. In DolphinDB, a single thread took 6,412 ms for calculation (250 million) while two threads took 2,938ms and four threads took only 2,086 ms. The newest code is in my updated answer. Maybe you can try it. — Summer.H, Nov 08 '19 at 06:40
Right - I think an earlier community version had 4 threads? Re-ran the benchmarks on my system with 250m counters using your updated code. For calculation only, 1 thread, DolphinDB is 7.802s, ClickHouse is 5.704s. Interesting things happen on two threads. Combining statements (random generation and calculation) into one call - DolphinDB is 5.441s, ClickHouse is 4.609s. For calculation only DolphinDB is 3978.46 ms and ClickHouse is 3.747s however the random data generation in ClickHouse in a separate statement takes around 300ms longer than DolphinDB. — Jackson, Nov 08 '19 at 12:39
With 10 million signals, 4 threads, clickhouse took 300 ms(in the original answer), while 1 thread, it took 291ms(in the updated answer). Is 1 thread is faster than 4 thread in clickhouse? — Summer.H, Nov 10 '19 at 02:58
Ah yep - I updated the query and didn't put it in the edited answer. Have updated now. New query is faster than the original one I provided. — Jackson, Nov 10 '19 at 11:28

Summer.H · Answer 3 · 2019-11-08T06:36:01.357

I optimized the testing code in DolphinDB like this:

t1= 60
t10 = 50
t20 = 30
t2 = 20
signal = rand(100.0, 10000000)
timer direction = (iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))).ffill().nullFill(0h)

It took only 330 ms.

Update: parallel version

DolphinDB offers the function pcall to parallelize the execution.

def foo(signal){
 t1= 60
 t10 = 50
 t20 = 30
 t2 = 20
 return iif(signal >t1, 1h, iif(signal < t10, 0h, 00h)) - iif(signal <t2, 1h, iif(signal > t20, 0h, 00h))
}
signal = rand(100.0, 250000000)

//with single threads
timer foo(signal).ffill().nullFill(0h)

//with multiple threads
timer pcall(foo,signal).ffill().nullFill(0h)

A single thread took 6,412 ms for calculation (250 million) while two threads took 2,938ms and four threads took only 2,086 ms.

score 0 · Answer 4 · answered Feb 19 '20 at 04:30

DolphinDB 1.01 introduced a new feature, JIT. One can achieve big performance improvement simply by adding a notation @jit in front of the function definition. Moreover, one can use for-loop to solve the above problem, much easier than the vectorized solution.

@jit
def calculate_with_jit(signal, n, t1, t10, t20, t2) {
  cur = 0
  idx = 0
  output = array(INT, n, n)
  for (s in signal) {
    if(s > t1) {           // (t1, inf)
      cur = 1
    } else if(s >= t10) {  // [t10, t1]
      if(cur == -1) cur = 0
    } else if(s > t20) {   // [t20, t10)
      cur = 0
    } else if(s >= t2) {   // [t2, t20]
      if(cur == 1) cur = 0
    } else {               // (-inf, t2)
      cur = -1
    }
    output[idx] = cur
    idx += 1
  }
  return output
}

On my machine, the jit version takes only 170ms for 10-million-long signal while the vectorized version takes 410ms.

Please refer the jit tutorial (https://github.com/dolphindb/Tutorials_EN/blob/master/jit.md) for more details

rewrite a kdb script in DolphinDB database

4 Answers4