I have a dataset where each sku (grouped by stores) has different start date:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-04 103993.0 001 1.0 8.0
3 2019-10-02 103994.0 002 1.0 11.0
4 2019-10-04 103994.0 002 1.0 10.0
5 2019-10-05 103994.0 002 0.0 10.0
6 2019-09-30 103991.0 012 0.0 14.0
7 2019-10-02 103991.0 012 1.0 13.0
8 2019-10-04 103991.0 012 1.0 12.0
9 2019-10-05 103991.0 012 0.0 10.0
I need to fill the date gap from a non-equal start date until an end date (that should be equal to all products - the maximum date from all product).
- The Units Column should be zero when there is a gap
- The Balance should be the previous day Values (ffill)
My expected output on this example is:
date sku store Units balance
0 2019-10-01 103993.0 001 0.0 10.0
1 2019-10-02 103993.0 001 1.0 9.0
2 2019-10-03 103993.0 001 0 9.0
3 2019-10-04 103993.0 001 1.0 8.0
4 2019-10-05 103993.0 001 0 8.0
5 2019-10-02 103994.0 002 1.0 11.0
5 2019-10-03 103994.0 002 0 11.0
6 2019-10-04 103994.0 002 1.0 10.0
7 2019-10-05 103994.0 002 0.0 10.0
8 2019-09-30 103991.0 012 0.0 14.0
9 2019-10-01 103991.0 012 0 14.0
10 2019-10-02 103991.0 012 1.0 13.0
11 2019-10-03 103991.0 012 0 13.0
12 2019-10-04 103991.0 012 1.0 12.0
13 2019-10-05 103991.0 012 0.0 10.0
I have noticed that postgres works with timescaleDB and it has some function like:
locf
andtime_bucket_gapfill
function
I have tried this function suggested on github:
SELECT *
FROM (SELECT
time_bucket_gapfill('1 day', date, '2019-09-30', '2019-10-05') as day,
sku,
store,
units,
COALESCE(units, 0) as units_filled,
locf(last(balance, date)) as balance
FROM train
WHERE date >= '2019-09-30'
GROUP BY sku, store, units, day ) f
WHERE balance IS NOT NULL
But it's a bit trick for me, to work properly.