2

I have the following dataframe:

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I want to create a dataframe/Multiindex that has a row for every single date that apprears in any of the records. I want to use fill_value=0 for the volumne column.

However I don't want to add any other combinations of Index levels eg. I do not want to add a row for an index with Source SA and Brand BC as there is no known combination of the two columns.

                             volume
month      source   brand    
2020-01-01     SA       BA        5
2020-02-01     SA       BA       10
2020-01-01     SA       BB        0 # Row to be added. 
2020-02-01     SA       BB        5
2020-01-01     SB       BC        5
2020-02-01     SB       BC       10

I've done this using windowing without index but it's incredibly slow (this df quite big).

I've tried to do this using this approach: How to reindex a MultiIndex dataframe with a setup that looks like this:

df_dates = df.groupby(['month']).sum() # df is the df with just a range index. 
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.  
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
  new_index_cols,
  names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)

But I realize that that MultiIndex.from_product would show every single combination of all index columns, which I don't want to achieve and which also would need some magnitudes more memory.

The way I see it I could potentially manipulate the full index columns myself by using index.get_level_values(i) and MultiIndex.from_array but I was hoping to find a more trivial process than that.

The process has to be generic as I need to apply it dataframes with different index column values but all share the same volume column and the month column in the index.

Hans
  • 2,800
  • 3
  • 28
  • 40

3 Answers3

1

I've often found MultiIndexes to be more trouble than they're worth, so here is a 'straight' or at least more traditional/relational alternative to your index_fill_missing function.

Note: requires Pandas >= 1.2 for the .merge(.., how='cross')

Starting from the dataframe in your recent answer:

       month brand source  volume1  volume2
0 2020-01-01    BA     SA        5        5
1 2020-02-01    BA     SA       10       10
2 2020-02-01    BB     SA        5        5
3 2020-01-01    BC     SB        5        5
4 2020-01-01    BC     SB       10       10

Aggregating the values per month is the first step:

df = (df.groupby(['month','source','brand'], as_index=False)
        .agg( {'volume1': np.sum, 'volume2': np.sum } ) ) 

Create a 'basis' dataframe that consists of all months, crossed with all prevalent source-brand combinations:

months = pd.DataFrame(df.month.drop_duplicates())
source_brand_combinations = df[['source','brand']].drop_duplicates()
basis = months.merge(source_brand_combinations, how='cross')

Merge 'basis' back with source data, filling in the actual volumes where available - and fillna(0) where values are not provided:

result = basis.merge( df, on=['month','source','brand'], how='left').fillna(0)
result[['volume1','volume2']] = result[['volume1','volume2']].astype(int)
       month source brand  volume1  volume2
0 2020-01-01     SA    BA        5        5
1 2020-01-01     SB    BC       15       15
2 2020-01-01     SA    BB        0        0
3 2020-02-01     SA    BA       10       10
4 2020-02-01     SB    BC        0        0
5 2020-02-01     SA    BB        5        5

...and if you want it with a multi-index:

result.set_index(['source','brand','month']).sort_values(['source','brand','month'])
                               volume1    volume2
source  brand   month
SA      BA      2020-01-01           5          5
                2020-02-01          10         10
        BB      2020-01-01           0          0
                2020-02-01           5          5
SB      BC      2020-01-01          15         15
                2020-02-01           0          0
kleynjan
  • 108
  • 5
0

When having your dataframe:

                            volume
month       source  brand   
2020-01-01  A       A       5.0
2020-02-01  A       A       10.0
2020-02-01  A       B       5.0

I believe you want a row for each unique combination of month, source, brand?

Does this help?

months = df.index.unique(level=0)
source = df.index.unique(level=1)
brands = df.index.unique(level=2)

df2 = pd.DataFrame(
    index = pd.MultiIndex.from_product(
        [months, source, brands])
    ).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)

This yields:

                                volume
    month       source  brand   
    2020-01-01  A       A       5.0
    2020-01-01  A       B       0.0
    2020-02-01  A       A       10.0
    2020-02-01  A       B       5.0

update 1

No sorry I was not clearer will update the original as well. I want only to have the dates filled up, but not all combinations of source/brand because otherwise I'd have billions of rows. (MY current error messages when running so if I have no Brand A in Source B then I only want to see one row per date for Brand A/Source A.

Try this for a change:

months = df.index.unique(level=0)
sourcebrand = df.groupby(level=['source','brand']).size().index
tuples = [(m,) + sb for sb in sourcebrand for m in months]

df2 = pd.DataFrame(index = pd.MultiIndex.from_tuples(tuples, names=['month','source','brand']))
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)

This yields:

                            volume
month       source  brand   
2020-01-01  SA      BA      5.0
2020-02-01  SA      BA      10.0
2020-01-01  SA      BB      0.0
2020-02-01  SA      BB      5.0
2020-01-01  SB      BC      5.0
2020-02-01  SB      BC      10.0

update 2

When there indices are not unique (f.e. 2 values for month=2020-01-01, source=SB, brand=BC; As seen in your own answer), you can easily sum this afterwards using:

.groupby(level=[0,1,2]).sum()
Paul
  • 1,801
  • 1
  • 12
  • 18
  • No sorry I was not clearer will update the original as well. I want only to have the dates filled up, but not all combintations of source/brand because otherwise I'd have billions of rows. (MY current error essages when running so if I have no Brand A in Source B then I only want to see one row perd date for Brand A/Source A. – Hans Jan 04 '22 at 16:16
0

I found my solution eventually though it's more complex than what I liked. I added it in a generic function:

def index_fill_missing(df, index_cols, fill_col, fill_value=0):
    """
    Finds all the unique values of the column 'fill_col' in df and 
    returns a dataframe with an index based on index_cols + fill_col where the
    a new row is added for any rows where the value in fill_col did not 
    previously exist in the dataframe. 

    The additional values are set to the value of the parameter 'fill_value'

    Parameters: 
    df pandas.DataFrame: the dataframe 
    index_cols list(str): the list of column names to use in the index column
    fill_col (str): the column name for which all values should appear in every 
    single index. 
    fill_value (any): the value to fill the metric columns in new rows. 

    Returns: 
    pandasPdateframe: DataFrame with MultiINdex and additional rows.
    """
    # Get unique values for the fill_col.
    fill_val_list = df[fill_col].unique().tolist()
    # Create a dataframe with the reduced index and get a list of tuples 
    # with the index values. 
    df_i = df.set_index(index_cols)
    df_i_tup = df_i.index.unique().tolist()
    # Append the fill col values to each and every of these index tuples. 
    df_f_tup = []
    col_names = list(index_cols)
    col_names.append(fill_col)
    print(col_names)
    for tup in df_i_tup:
        for fill_val in fill_val_list:
            df_f_tup.append(tup + (fill_val,))
    # Create an index based on these tuples and reindex the dataframe. 
    idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
    print(idx_f)
    # We can only reindex if there are no duplicate values 
    # Hence the groupby with sum function. 
    df_g = df.groupby(by=col_names).sum()
    df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
    return df_f

Creating the sample dataframe:

  '2020-01-01', '2020-02-01', 
  '2020-02-01', 
  '2020-01-01', '2020-02-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
  list(zip(dates, brands, sources, volumes1, volumes2)), 
  columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df

Resulting Output:

        month brand source  volume1  volume2
0  2020-01-01    BA     SA        5        5
1  2020-02-01    BA     SA       10       10
2  2020-02-01    BB     SA        5        5
3  2020-01-01    BC     SB        5        5
4  2020-02-01    BC     SB       10       10

And applying the function:

df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2

Resulting output:

                         volume1  volume2
source brand month                       
SA     BA    2020-01-01        5        5
             2020-02-01       10       10
       BB    2020-01-01        0        0
             2020-02-01        5        5
SB     BC    2020-01-01       15       15
             2020-02-01        0        0
Hans
  • 2,800
  • 3
  • 28
  • 40
  • Why do you suddenly have 2x `2020-01-01` for BC-SB combination? This is different than your Original Question? – Paul Jan 05 '22 at 12:44
  • That was in error, outputs are not changed though. – Hans Jan 10 '22 at 10:44
  • Update #2 in my answer will complete my answer including the summation, please check if this answers your question better than your own answer :) – Paul Jan 10 '22 at 15:06