I have the following dataframe:
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
I want to create a dataframe/Multiindex that has a row for every single date that apprears in any of the records. I want to use fill_value=0
for the volumne column.
However I don't want to add any other combinations of Index levels eg. I do not want to add a row for an index with Source SA and Brand BC as there is no known combination of the two columns.
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-01-01 SA BB 0 # Row to be added.
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
I've done this using windowing without index but it's incredibly slow (this df quite big).
I've tried to do this using this approach: How to reindex a MultiIndex dataframe with a setup that looks like this:
df_dates = df.groupby(['month']).sum() # df is the df with just a range index.
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
new_index_cols,
names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)
But I realize that that MultiIndex.from_product
would show every single combination of all index columns, which I don't want to achieve and which also would need some magnitudes more memory.
The way I see it I could potentially manipulate the full index columns myself by using index.get_level_values(i)
and MultiIndex.from_array
but I was hoping to find a more trivial process than that.
The process has to be generic as I need to apply it dataframes with different index column values but all share the same volume column and the month column in the index.