On a large table (> 1 billion rows), for every timestamp, I want to sum values over rows with a timestamp smaller or equal to the current timestamp where the price in column A (out_price
) is smaller than the price in column B (price
) from the current row.
Take this table for example:
| id | value | time | price | out_price |expected_output|
|----|-------|------|-------|-----------|---------------|
| a | 2 | 1 | 1 | 1 | 0 |
| a | -1 | 2 | 3 | 1 | 1 |--> row 1 and 2
| b | 1 | 2 | 3 | 3 | 1 |--> row 1 and 2
| c | 3 | 2 | 3 | 3 | 1 |--> row 1 and 2
| c | -2 | 3 | 2 | 3 | 1 |--> row 1 and 2
| a | 1 | 3 | 2 | 2 | 1 |--> row 1 and 2
| a | -1 | 3 | 2 | 2 | 1 |--> row 1 and 2
| c | 2 | 3 | 2 | 2 | 1 |--> row 1 and 2
| e | 1 | 3 | 2 | 2 | 1 |--> row 1 and 2
| b | -1 | 4 | 4 | 3 | 5 |--> row 1 to 10
| d | 4 | 4 | 4 | 4 | 5 |--> row 1 to 10
| e | -1 | 5 | 2 | 2 | 1 |--> row 1 and 2
for every row I want to check if and row with a time
equal or smaller than the current rows time
has an out_price
smaller than the current rows price
. If so I want to sum over all respective rows.
Here is a query to work with:
WITH data AS (
SELECT *
, SUM(value) OVER (PARTITION BY id ORDER BY time RANGE UNBOUNDED PRECEDING) as id_sum
FROM UNNEST([
STRUCT
('a' as id, 2 as value, 1 as time, 1 as out_time, 2 as in_time),
('a', -1, 2, 1, 2),
('b', 1, 2, 2, 4),
('b', -1, 4, 2, 4),
('c', 3, 2, 2, 3),
('c', -2, 3, 2, 3),
('a', 1, 3, 3, 3),
('a', -1, 3, 3, 3),
('c', 2, 3, 3, null),
('d', 4, 4, 4, null),
('e', 1, 3, 3, 5),
('e', -1, 5, 3, 5)
])
)
, prices AS (
SELECT *
FROM UNNEST([
STRUCT
(1 as time, 1 as price),
(2, 3),
(3, 2),
(4, 4),
(5, 2)
]) as p
)
, combined_data AS (
SELECT
d.id, d.value, d.time
,p.price AS price
,pout.price AS out_price
from data d
LEFT JOIN prices p ON p.time=d.time
LEFT JOIN prices pin ON pin.time=d.in_time
LEFT JOIN prices pout ON pout.time=d.out_time
)
SELECT *
FROM combined_data
I tried something like
SELECT *
, SUM(value) OVER (PARTITION BY time ORDER BY price RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as in_profit_per_day
FROM combined_data
but this only sums over the current time
.
In words, the query should do something like:
SUM(if(out_price<VALUE_OF(price) AT CURRENT ROW,value,0)) OVER (ORDER BY time RANGE UNBOUNDED PRECEDING)
Thanks!
UPDATE:
I'm able to retrieve the desired results with a self-join, but this operation is only feasible for my small dataset. For my > 1 billion rows it's not possible.
SELECT
cd1.id
, cd1.value
, cd1.time
, cd1.price
, cd1.out_price
, sum(cd2.value) as value_in_profit
FROM combined_data cd1
LEFT JOIN combined_data cd2
ON cd2.time <= cd1.time
AND cd2.out_price < cd1.price
GROUP BY
cd1.id
, cd1.value
, cd1.time
, cd1.price
, cd1.out_price
ORDER BY cd1.time