3

I have a dataset where each sku (grouped by stores) has different start date:

      date       sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-04  103993.0    001    1.0      8.0


3  2019-10-02  103994.0    002    1.0     11.0
4  2019-10-04  103994.0    002    1.0     10.0
5  2019-10-05  103994.0    002    0.0     10.0

6  2019-09-30  103991.0    012    0.0     14.0
7  2019-10-02  103991.0    012    1.0     13.0
8  2019-10-04  103991.0    012    1.0     12.0
9  2019-10-05  103991.0    012    0.0     10.0

I need to fill the date gap from a non-equal start date until an end date (that should be equal to all products - the maximum date from all product).

  • The Units Column should be zero when there is a gap
  • The Balance should be the previous day Values (ffill)

My expected output on this example is:

      date       sku     store  Units   balance
0  2019-10-01  103993.0    001    0.0     10.0
1  2019-10-02  103993.0    001    1.0      9.0
2  2019-10-03  103993.0    001    0        9.0
3  2019-10-04  103993.0    001    1.0      8.0
4  2019-10-05  103993.0    001    0        8.0

5  2019-10-02  103994.0    002    1.0     11.0
5  2019-10-03  103994.0    002    0       11.0
6  2019-10-04  103994.0    002    1.0     10.0
7  2019-10-05  103994.0    002    0.0     10.0

8   2019-09-30  103991.0    012    0.0     14.0
9   2019-10-01  103991.0    012    0       14.0
10  2019-10-02  103991.0    012    1.0     13.0
11  2019-10-03  103991.0    012    0       13.0
12  2019-10-04  103991.0    012    1.0     12.0
13  2019-10-05  103991.0    012    0.0     10.0

I have noticed that postgres works with timescaleDB and it has some function like:

locf and time_bucket_gapfill function

I have tried this function suggested on github:

 SELECT * 
    FROM (SELECT 
        time_bucket_gapfill('1 day', date, '2019-09-30', '2019-10-05') as day, 
        sku, 
        store, 
        units,
        COALESCE(units, 0) as units_filled, 
        locf(last(balance, date)) as balance 
        FROM train
        WHERE date >= '2019-09-30' 
        GROUP BY sku, store, units, day ) f 
    WHERE balance IS NOT NULL

But it's a bit trick for me, to work properly.

Cesar
  • 575
  • 3
  • 16

1 Answers1

1

I would recommend:

select gs.dte, tt.store, tt.sku, coalesce(t.units, 0) as units,
       coalesce(t.balance,
                max(t.balance) over (partition by tt.store, tt.sku order by gs.dte)
               )
from (select store, sku, min(date) as min_date,
             max(max(date)) over () as max_date
      from train
      group by store, sku
     ) tt cross join lateral
     generate_series(tt.min_date, tt.max_date, interval '1 day') gs(dte) left join
     train t
      on tt.store = t.store and
         tt.sku = t.sku and
         tt.date = gs.dte;

This particular version assumes that balance is always decreasing (as in your example data). If that is not the case, the logic can be adjusted.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786