1

I have a pandas DataFrame, I'm trying to (in pandas or DuckDB SQL) do the following on each iteration partitioned by CODE, DAY, and TIME:

  1. Iterate on each row to calculate the sum total of the 2 previous TRANSACTIONS or TRANSACTIONS_FORECAST values (the first non NULL value e.g. COALESCE(TRANSACTIONS, TRANSACTIONS_FORECAST)
  2. Calculate the sum total of the TRANSACTIONS_WEEK values.
  3. Calculate the TRANSACTION_FORECAST value of step1 / step2 * TRANSACTIONS_WEEK
  4. Stop iteration as soon as TRANSACTION_FORECAST column values are populated.

Here's the DataFrame:

import pandas as pd
from io import StringIO

csv_string="""'CODE,DAY,TIME,WEEK,TRANSACTIONS,TRANSACTIONS_WEEK,SOURCE\nA,Monday,1,20,23,263154,actual\nA,Monday,1,21,16,246649,actual\nA,Monday,1,23,,244086.6208,forecast\nA,Monday,1,24,,243197.7547,forecast\nA,Monday,1,25,,235561.9992,forecast\nA,Monday,1,26,,231105.5393,forecast\nA,Monday,1,27,,232744.1484,forecast\nA,Monday,1,28,,238718.1522,forecast\nA,Monday,1,29,,234870.8116,forecast\nA,Monday,1,30,,230410.6348,forecast\nA,Monday,1,31,,229832.8125,forecast\nA,Monday,1,32,,227024.5631,forecast\nA,Monday,1,33,,226483.0862,forecast\nA,Monday,1,34,,229247.3648,forecast\nA,Monday,1,35,,221272.5875,forecast\nA,Monday,1,36,,250239.7494,forecast\nA,Monday,1,37,,263229.4532,forecast\nA,Monday,1,38,,252955.314,forecast\nA,Monday,1,39,,241695.9493,forecast\nA,Monday,1,40,,247447.6128,forecast\nA,Monday,1,41,,247364.4851,forecast\nA,Monday,1,42,,244082.4747,forecast\nA,Monday,1,43,,229432.3064,forecast\nA,Monday,1,44,,222934.6285,forecast\nA,Monday,1,45,,224727.4305,forecast\nA,Monday,1,46,,225616.1613,forecast\nA,Monday,1,47,,225950.7391,forecast\nA,Monday,1,48,,225553.2239,forecast\nA,Monday,1,49,,225523.3712,forecast\nA,Monday,1,50,,215116.1205,forecast\nA,Monday,1,51,,239592.5374,forecast\nA,Monday,1,52,,228592.4596,forecast\nB,Monday,1,20,29,263154,orders_base\nB,Monday,1,21,27,246649,orders_base\nB,Monday,1,23,,244086.6208,forecast\nB,Monday,1,24,,243197.7547,forecast\nB,Monday,1,25,,235561.9992,forecast\nB,Monday,1,26,,231105.5393,forecast\nB,Monday,1,27,,232744.1484,forecast\nB,Monday,1,28,,238718.1522,forecast\nB,Monday,1,29,,234870.8116,forecast\nB,Monday,1,30,,230410.6348,forecast\nB,Monday,1,31,,229832.8125,forecast\nB,Monday,1,32,,227024.5631,forecast\nB,Monday,1,33,,226483.0862,forecast\nB,Monday,1,34,,229247.3648,forecast\nB,Monday,1,35,,221272.5875,forecast\nB,Monday,1,36,,250239.7494,forecast\nB,Monday,1,37,,263229.4532,forecast\nB,Monday,1,38,,252955.314,forecast\nB,Monday,1,39,,241695.9493,forecast\nB,Monday,1,40,,247447.6128,forecast\nB,Monday,1,41,,247364.4851,forecast\nB,Monday,1,42,,244082.4747,forecast\nB,Monday,1,43,,229432.3064,forecast\nB,Monday,1,44,,222934.6285,forecast\nB,Monday,1,45,,224727.4305,forecast\nB,Monday,1,46,,225616.1613,forecast\nB,Monday,1,47,,225950.7391,forecast\nB,Monday,1,48,,225553.2239,forecast\nB,Monday,1,49,,225523.3712,forecast\nB,Monday,1,50,,215116.1205,forecast\nB,Monday,1,51,,239592.5374,forecast\nB,Monday,1,52,,228592.4596,forecast\nC,Saturday,2,19,173,259156,orders_base\nC,Saturday,2,20,179,263154,orders_base\nC,Saturday,2,21,185,246649,orders_base\nC,Saturday,2,22,162,225220,orders_base\nC,Saturday,2,23,,244086.6208,forecast\nC,Saturday,2,24,,243197.7547,forecast\nC,Saturday,2,25,,235561.9992,forecast\nC,Saturday,2,26,,231105.5393,forecast\nC,Saturday,2,27,,232744.1484,forecast\nC,Saturday,2,28,,238718.1522,forecast\nC,Saturday,2,29,,234870.8116,forecast\nC,Saturday,2,30,,230410.6348,forecast\nC,Saturday,2,31,,229832.8125,forecast\nC,Saturday,2,32,,227024.5631,forecast\nC,Saturday,2,33,,226483.0862,forecast\nC,Saturday,2,34,,229247.3648,forecast\nC,Saturday,2,35,,221272.5875,forecast\nC,Saturday,2,36,,250239.7494,forecast\nC,Saturday,2,37,,263229.4532,forecast\nC,Saturday,2,38,,252955.314,forecast\nC,Saturday,2,39,,241695.9493,forecast\nC,Saturday,2,40,,247447.6128,forecast\nC,Saturday,2,41,,247364.4851,forecast\nC,Saturday,2,42,,244082.4747,forecast\nC,Saturday,2,43,,229432.3064,forecast\nC,Saturday,2,44,,222934.6285,forecast\nC,Saturday,2,45,,224727.4305,forecast\nC,Saturday,2,46,,225616.1613,forecast\nC,Saturday,2,47,,225950.7391,forecast\nC,Saturday,2,48,,225553.2239,forecast\nC,Saturday,2,49,,225523.3712,forecast\nC,Saturday,2,50,,215116.1205,forecast\nC,Saturday,2,51,,239592.5374,forecast\nC,Saturday,2,52,,228592.4596,forecast'"""
df = pd.read_csv(StringIO(csv_string))

Here's the expected result:

csv_string_result="""'CODE,DAY,TIME,WEEK,TRANSACTIONS,TRANSACTIONS_WEEK,SOURCE,TRANSACTIONS_ROLLING_2_SUM,TRANSACTIONS_WEEK_ROLLING_2_SUM,TRANSACTIONS_FORECAST\nA,Monday,1,20,23,263154,actual,,,\nA,Monday,1,21,16,246649,actual,,,\nA,Monday,1,23,,244086.6208,forecast,39,509803,18.67266025\nA,Monday,1,24,,243197.7547,forecast,34.67266025,490735.6208,17.18300601\nA,Monday,1,25,,235561.9992,forecast,35.85566626,487284.3756,17.3332716\nA,Monday,1,26,,231105.5393,forecast,34.51627761,478759.7539,16.66159882\nA,Monday,1,27,,232744.1484,forecast,,,\nA,Monday,1,28,,238718.1522,forecast,,,\nA,Monday,1,29,,234870.8116,forecast,,,\nA,Monday,1,30,,230410.6348,forecast,,,\nA,Monday,1,31,,229832.8125,forecast,,,\nA,Monday,1,32,,227024.5631,forecast,,,\nA,Monday,1,33,,226483.0862,forecast,,,\nA,Monday,1,34,,229247.3648,forecast,,,\nA,Monday,1,35,,221272.5875,forecast,,,\nA,Monday,1,36,,250239.7494,forecast,,,\nA,Monday,1,37,,263229.4532,forecast,,,\nA,Monday,1,38,,252955.314,forecast,,,\nA,Monday,1,39,,241695.9493,forecast,,,\nA,Monday,1,40,,247447.6128,forecast,,,\nA,Monday,1,41,,247364.4851,forecast,,,\nA,Monday,1,42,,244082.4747,forecast,,,\nA,Monday,1,43,,229432.3064,forecast,,,\nA,Monday,1,44,,222934.6285,forecast,,,\nA,Monday,1,45,,224727.4305,forecast,,,\nA,Monday,1,46,,225616.1613,forecast,,,\nA,Monday,1,47,,225950.7391,forecast,,,\nA,Monday,1,48,,225553.2239,forecast,,,\nA,Monday,1,49,,225523.3712,forecast,,,\nA,Monday,1,50,,215116.1205,forecast,,,\nA,Monday,1,51,,239592.5374,forecast,,,\nA,Monday,1,52,,228592.4596,forecast,,,\nB,Monday,1,20,29,263154,orders_base,,,\nB,Monday,1,21,27,246649,orders_base,,,\nB,Monday,1,23,,244086.6208,forecast,56,509803,26.81202497\nB,Monday,1,24,,243197.7547,forecast,53.81202497,490735.6208,26.66805322\nB,Monday,1,25,,235561.9992,forecast,53.48007819,487284.3756,25.85322815\nB,Monday,1,26,,231105.5393,forecast,52.52128136,478759.7539,25.35292274\nB,Monday,1,27,,232744.1484,forecast,,,\nB,Monday,1,28,,238718.1522,forecast,,,\nB,Monday,1,29,,234870.8116,forecast,,,\nB,Monday,1,30,,230410.6348,forecast,,,\nB,Monday,1,31,,229832.8125,forecast,,,\nB,Monday,1,32,,227024.5631,forecast,,,\nB,Monday,1,33,,226483.0862,forecast,,,\nB,Monday,1,34,,229247.3648,forecast,,,\nB,Monday,1,35,,221272.5875,forecast,,,\nB,Monday,1,36,,250239.7494,forecast,,,\nB,Monday,1,37,,263229.4532,forecast,,,\nB,Monday,1,38,,252955.314,forecast,,,\nB,Monday,1,39,,241695.9493,forecast,,,\nB,Monday,1,40,,247447.6128,forecast,,,\nB,Monday,1,41,,247364.4851,forecast,,,\nB,Monday,1,42,,244082.4747,forecast,,,\nB,Monday,1,43,,229432.3064,forecast,,,\nB,Monday,1,44,,222934.6285,forecast,,,\nB,Monday,1,45,,224727.4305,forecast,,,\nB,Monday,1,46,,225616.1613,forecast,,,\nB,Monday,1,47,,225950.7391,forecast,,,\nB,Monday,1,48,,225553.2239,forecast,,,\nB,Monday,1,49,,225523.3712,forecast,,,\nB,Monday,1,50,,215116.1205,forecast,,,\nB,Monday,1,51,,239592.5374,forecast,,,\nB,Monday,1,52,,228592.4596,forecast,,,\nC,Saturday,2,19,173,259156,orders_base,,,\nC,Saturday,2,20,179,263154,orders_base,,,\nC,Saturday,2,21,185,246649,orders_base,,,\nC,Saturday,2,22,162,225220,orders_base,,,\nC,Saturday,2,23,,244086.6208,forecast,347,471869,179.4948544\nC,Saturday,2,24,,243197.7547,forecast,341.4948544,469306.6208,176.9648629\nC,Saturday,2,25,,235561.9992,forecast,356.4597173,487284.3756,172.319015\nC,Saturday,2,26,,231105.5393,forecast,349.283878,478759.7539,168.6053147\nC,Saturday,2,27,,232744.1484,forecast,,,\nC,Saturday,2,28,,238718.1522,forecast,,,\nC,Saturday,2,29,,234870.8116,forecast,,,\nC,Saturday,2,30,,230410.6348,forecast,,,\nC,Saturday,2,31,,229832.8125,forecast,,,\nC,Saturday,2,32,,227024.5631,forecast,,,\nC,Saturday,2,33,,226483.0862,forecast,,,\nC,Saturday,2,34,,229247.3648,forecast,,,\nC,Saturday,2,35,,221272.5875,forecast,,,\nC,Saturday,2,36,,250239.7494,forecast,,,\nC,Saturday,2,37,,263229.4532,forecast,,,\nC,Saturday,2,38,,252955.314,forecast,,,\nC,Saturday,2,39,,241695.9493,forecast,,,\nC,Saturday,2,40,,247447.6128,forecast,,,\nC,Saturday,2,41,,247364.4851,forecast,,,\nC,Saturday,2,42,,244082.4747,forecast,,,\nC,Saturday,2,43,,229432.3064,forecast,,,\nC,Saturday,2,44,,222934.6285,forecast,,,\nC,Saturday,2,45,,224727.4305,forecast,,,\nC,Saturday,2,46,,225616.1613,forecast,,,\nC,Saturday,2,47,,225950.7391,forecast,,,\nC,Saturday,2,48,,225553.2239,forecast,,,\nC,Saturday,2,49,,225523.3712,forecast,,,\nC,Saturday,2,50,,215116.1205,forecast,,,\nC,Saturday,2,51,,239592.5374,forecast,,,\nC,Saturday,2,52,,228592.4596,forecast,,,'"""
df_result = pd.read_csv(StringIO(csv_string_result))

Here's the DuckDB recursive CTE I attempted:

-- The problem with this is that terminal would kill the process after 10 mins or so
WITH RECURSIVE ROLLING_SUM AS (

SELECT CODE
, DAY
, TIME
, WEEK
, TRANSACTIONS
, TRANSACTIONS_WEEK
, CASE
    WHEN TRANSACTIONS IS NULL
    THEN SUM(TRANSACTIONS) OVER (
        PARTITION BY CODE, DAY, TIME
        ORDER BY WEEK
        ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING
    )
    END AS ROLLING_2_SUM_TRANSACTIONS
, CASE -- noticed don't actually need a CASE statement for this
    WHEN TRANSACTIONS IS NULL
    THEN SUM(TRANSACTIONS_WEEK) OVER (
        PARTITION BY CODE, DAY, TIME
        ORDER BY WEEK
        ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING
    )
    END AS ROLLING_2_SUM_TRANSACTIONS_WEEK
, ROLLING_2_SUM_TRANSACTIONS / ROLLING_2_SUM_TRANSACTIONS_WEEK AS ROLLING_2_TRANSACTIONS_PCT
, ROLLING_2_TRANSACTIONS_PCT * TRANSACTIONS_WEEK AS TRANSACTIONS_FORECAST
, SOURCE
FROM df

UNION ALL

SELECT CODE
, DAY
, TIME
, WEEK
, TRANSACTIONS
, TRANSACTIONS_WEEK
, CASE
    WHEN COALESCE(TRANSACTIONS, TRANSACTIONS_FORECAST) IS NULL
    THEN SUM(
        COALESCE(TRANSACTIONS, TRANSACTIONS_FORECAST)
    ) OVER (
        PARTITION BY CODE, DAY, TIME
        ORDER BY WEEK
        ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING
    )
    END
, CASE
    WHEN COALESCE(TRANSACTIONS, TRANSACTIONS_FORECAST) IS NULL
    THEN SUM(TRANSACTIONS_WEEK) OVER (
        PARTITION BY CODE, DAY, TIME
        ORDER BY WEEK
        ROWS BETWEEN 2 PRECEDING AND 1 PRECEDING
    )
    END
, ROLLING_2_SUM_TRANSACTIONS / ROLLING_2_SUM_TRANSACTIONS_WEEK AS ROLLING_2_TRANSACTIONS_PCT
, ROLLING_2_TRANSACTIONS_PCT * TRANSACTIONS_WEEK AS TRANSACTIONS_FORECAST
, SOURCE
FROM ROLLING_SUM
WHERE WEEK <= 52

)

SELECT *
FROM ROLLING_SUM
WHERE COALESCE(TRANSACTIONS, TRANSACTIONS_FORECAST) IS NOT NULL

Tried in pure pandas but it was quickly getting out of hand. Looked into itertuples but not sure how to get a rolling sum doing that.

Help on this would be greatly appreciated.

AK91
  • 671
  • 2
  • 13
  • 35
  • 1
    `df` and `df_result` appear to be identical, perhaps you put the incorrect `df_result` in by mistake? – jqurious Jun 07 '23 at 15:42
  • Can you explain a bit about how transaction forecast gets generated? It appears that in step 1 you require transaction forecast but it doesn't get generated until step 3 in your steps. – magladde Jun 07 '23 at 20:30

1 Answers1

1

If I understand correctly, it looks like you're trying to:

window_size = 2

df[[
    'TRANSACTIONS_ROLLING_2_SUM', 
    'TRANSACTIONS_WEEK_ROLLING_2_SUM', 
    'TRANSACTIONS_FORECAST'
]] = float('nan')

df[
    'TRANSACTIONS_WEEK_ROLLING_2_SUM'
] = df.rolling(window_size)['TRANSACTIONS_WEEK'].sum().shift()

for idx in range(window_size, len(df.index)):
   window = df.iloc[idx - window_size:idx]
   
   transactions_week = df.loc[idx, 'TRANSACTIONS_WEEK']
   
   transactions_rolling_2_sum = window['TRANSACTIONS'].fillna(window['TRANSACTIONS_FORECAST']).sum()
   transactions_week_rolling_2_sum = window['TRANSACTIONS_WEEK'].sum()
   
   transactions_forecast = transactions_rolling_2_sum / transactions_week_rolling_2_sum * transactions_week
   
   df.loc[idx, 'TRANSACTIONS_FORECAST'] = transactions_forecast
   df.loc[idx, 'TRANSACTIONS_ROLLING_2_SUM'] = transactions_rolling_2_sum

If we look at the non-na rows from your df_result:

df_result.dropna(subset=['TRANSACTIONS_ROLLING_2_SUM'])
   CODE       DAY  TIME  WEEK  TRANSACTIONS  TRANSACTIONS_WEEK    SOURCE  TRANSACTIONS_ROLLING_2_SUM  TRANSACTIONS_WEEK_ROLLING_2_SUM  TRANSACTIONS_FORECAST
2     A    Monday     1    23           NaN        244086.6208  forecast                   39.000000                      509803.0000              18.672660
3     A    Monday     1    24           NaN        243197.7547  forecast                   34.672660                      490735.6208              17.183006
4     A    Monday     1    25           NaN        235561.9992  forecast                   35.855666                      487284.3756              17.333272
5     A    Monday     1    26           NaN        231105.5393  forecast                   34.516278                      478759.7539              16.661599
34    B    Monday     1    23           NaN        244086.6208  forecast                   56.000000                      509803.0000              26.812025
35    B    Monday     1    24           NaN        243197.7547  forecast                   53.812025                      490735.6208              26.668053
36    B    Monday     1    25           NaN        235561.9992  forecast                   53.480078                      487284.3756              25.853228
37    B    Monday     1    26           NaN        231105.5393  forecast                   52.521281                      478759.7539              25.352923
68    C  Saturday     2    23           NaN        244086.6208  forecast                  347.000000                      471869.0000             179.494854
69    C  Saturday     2    24           NaN        243197.7547  forecast                  341.494854                      469306.6208             176.964863
70    C  Saturday     2    25           NaN        235561.9992  forecast                  356.459717                      487284.3756             172.319015
71    C  Saturday     2    26           NaN        231105.5393  forecast                  349.283878                      478759.7539             168.605315

We can check that df now contains the same values:

pd.testing.assert_frame_equal(
    df.loc[df_result.dropna(subset=['TRANSACTIONS_ROLLING_2_SUM']).index],
    df_result.dropna(subset=['TRANSACTIONS_ROLLING_2_SUM'])
)
jqurious
  • 9,953
  • 1
  • 4
  • 14