0

On a large table (> 1 billion rows), for every timestamp, I want to sum values over rows with a timestamp smaller or equal to the current timestamp where the price in column A (out_price) is smaller than the price in column B (price) from the current row.

Take this table for example:

| id | value | time | price | out_price |expected_output|
|----|-------|------|-------|-----------|---------------|
| a  | 2     | 1    | 1     | 1         | 0             |
| a  | -1    | 2    | 3     | 1         | 1             |--> row 1 and 2
| b  | 1     | 2    | 3     | 3         | 1             |--> row 1 and 2
| c  | 3     | 2    | 3     | 3         | 1             |--> row 1 and 2
| c  | -2    | 3    | 2     | 3         | 1             |--> row 1 and 2
| a  | 1     | 3    | 2     | 2         | 1             |--> row 1 and 2
| a  | -1    | 3    | 2     | 2         | 1             |--> row 1 and 2
| c  | 2     | 3    | 2     | 2         | 1             |--> row 1 and 2
| e  | 1     | 3    | 2     | 2         | 1             |--> row 1 and 2
| b  | -1    | 4    | 4     | 3         | 5             |--> row 1 to 10
| d  | 4     | 4    | 4     | 4         | 5             |--> row 1 to 10
| e  | -1    | 5    | 2     | 2         | 1             |--> row 1 and 2

for every row I want to check if and row with a time equal or smaller than the current rows time has an out_price smaller than the current rows price. If so I want to sum over all respective rows.

Here is a query to work with:

WITH data AS (
  SELECT * 
  , SUM(value) OVER (PARTITION BY id ORDER BY time RANGE UNBOUNDED PRECEDING) as id_sum
  FROM UNNEST([
    STRUCT
    ('a' as id, 2 as value, 1 as time, 1 as out_time, 2 as in_time), 
    ('a', -1, 2, 1, 2),
    ('b', 1, 2, 2, 4),
    ('b', -1, 4, 2, 4),
    ('c', 3, 2, 2, 3),
    ('c', -2, 3, 2, 3),
    ('a', 1, 3, 3, 3),
    ('a', -1, 3, 3, 3),
    ('c', 2, 3, 3, null),
    ('d', 4, 4, 4, null),
    ('e', 1, 3, 3, 5),
    ('e', -1, 5, 3, 5)
  ])
)
, prices AS (
   SELECT *
   FROM UNNEST([
      STRUCT
      (1 as time, 1 as price),
      (2, 3),
      (3, 2),
      (4, 4),
      (5, 2)
  ]) as p
)
, combined_data AS (
    SELECT 
        d.id, d.value, d.time
        ,p.price AS price
        ,pout.price AS out_price
    from data d
    LEFT JOIN prices p ON p.time=d.time
    LEFT JOIN prices pin ON pin.time=d.in_time
    LEFT JOIN prices pout ON pout.time=d.out_time
)

SELECT *
FROM combined_data

I tried something like

SELECT *
    , SUM(value) OVER (PARTITION BY time ORDER BY price RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as in_profit_per_day
FROM combined_data

but this only sums over the current time.

In words, the query should do something like:

SUM(if(out_price<VALUE_OF(price) AT CURRENT ROW,value,0)) OVER (ORDER BY time RANGE UNBOUNDED PRECEDING)

Thanks!

UPDATE:

I'm able to retrieve the desired results with a self-join, but this operation is only feasible for my small dataset. For my > 1 billion rows it's not possible.

SELECT
  cd1.id
  , cd1.value
  , cd1.time
  , cd1.price
  , cd1.out_price
  , sum(cd2.value) as value_in_profit
FROM combined_data cd1
LEFT JOIN combined_data cd2
  ON cd2.time <= cd1.time
  AND cd2.out_price < cd1.price
GROUP BY 
  cd1.id
  , cd1.value
  , cd1.time
  , cd1.price
  , cd1.out_price
ORDER BY cd1.time
smaica
  • 723
  • 2
  • 11
  • 26
  • can you show a table with the actual wanted result. that ypi want, i read your text three times and still don't get it. – nbk Oct 01 '22 at 18:52
  • You could use self join or a correlated sub-query but both will be very costly operation on data of this size. – PankajSanwal Oct 02 '22 at 02:34
  • @nbk there is a column in the table showing the actual wanted result `expected_output` and also an explanation over which rows the sum should be calculated. For each row `r` I want to `sum(value)` over all previous `time` rows `r-t` where `out_price` at `r-t` is smaller then the value of `price` at the current row `r` – smaica Oct 02 '22 at 07:21

0 Answers0