4

I'm working on this Kaggle competition as the final project for the course I'm taking, and for that, I was trying to replicate this notebook but there is a function he uses to get the lagged features that is just using way too much memory for me. Here is his code:

def lag_feature(df, lags, col):
    tmp = df[['date_block_num','shop_id','item_id',col]]
    for i in lags:
        shifted = tmp.copy()
        shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
        shifted['date_block_num'] += i
        df = pd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
    return df

After failing to run with his code I made some slight modifications to try to reduce the memory usage, and I started using google colab because it has more memory than my laptop so here is my code:

def lag_feature(df, lags, col):
  df = dd.from_pandas(df, chunksize=1000)
  tmp = df[['date_block_num','shop_id','item_id',col]]
  for i in lags:
    shifted = tmp[tmp.date_block_num + i <= 34].copy()
    shifted.columns = ['date_block_num','shop_id','item_id', col+'_lag_'+str(i)]
    shifted['date_block_num'] += i
    df = dd.merge(df, shifted, on=['date_block_num','shop_id','item_id'], how='left')
  return df.compute()

But still uses way too much memory, it got to the point where my code was using the 10 Gb o memory that google offers for this function call

sales_train = lag_feature(sales_train, [1, 2, 3, 12, 20], 'item_cnt_month')

are there ways that I can decrease my memory usage? Just to show, this is my dataframe:

Int64Index: 2829445 entries, 0 to 3134798
Data columns (total 8 columns):
date                object
date_block_num      int8
item_cnt_day        float16
item_id             int16
item_price          float16
shop_id             int8
item_cnt_month      float16
item_category_id    int8
dtypes: float16(3), int16(1), int8(3), object(1)
memory usage: 152.9+ MB

Just to add more info, the column 'date_block_num' keeps a number which identifies which month that feature happened, what I'm trying to do is get some data from a previous month into that row. So, if I had a lag of 1, means that I want to get the data from a month ago for each product in my dataframe and add it to another column with the name 'feature_lag_1'. For example, with this dataframe:

         date  date_block_num  item_cnt_day  item_id  item_price  shop_id  \
0  14.09.2013               8           1.0     2848        99.0       24   
1  14.09.2013               8           1.0     2848        99.0       24   
2  14.09.2013               8           1.0     2848        99.0       24   
3  01.09.2013               8           1.0     2848        99.0       24   
4  01.09.2013               8           1.0     2848        99.0       24   

   item_cnt_month  item_category_id
0             2.0                30
1             2.0                30 
2             2.0                30 
3             2.0                30 
4             2.0                30

and this function call:

sales_train = lag_feature(sales_train, [1], 'item_cnt_month')

I want this output:

         date  date_block_num  item_cnt_day  item_id  item_price  shop_id  \
0  14.09.2013               8           1.0     2848        99.0       24   
1  14.09.2013               8           1.0     2848        99.0       24   
2  14.09.2013               8           1.0     2848        99.0       24   
3  01.09.2013               8           1.0     2848        99.0       24   
4  01.09.2013               8           1.0     2848        99.0       24   

   item_cnt_month  item_category_id  item_cnt_month_lag_1  
0             2.0                30                   3.0  
1             2.0                30                   3.0  
2             2.0                30                   3.0  
3             2.0                30                   3.0  
4             2.0                30                   3.0  
João Areias
  • 1,192
  • 11
  • 41
  • 1
    Can you provide a [mcve] with your data and expected output? I don't see why you need a merge here. You might be able to get away with just `.shift` or simply using the index to add the new Series as a column with automatic alignment on index. – ALollz Oct 07 '18 at 23:45
  • Right now I'm just waiting to see if MRockilin's answer will work, and as soon as that finish processing I can post more details, but if it works I can try to describe it better. Is just that my notebook is processing still. – João Areias Oct 07 '18 at 23:59
  • I made a few changes which I hope will be helpful, as soon as I can, I'll add the example. – João Areias Oct 08 '18 at 00:06
  • @ALollz , I've added more info if that helps – João Areias Oct 08 '18 at 01:09
  • Are you using a Jupyter Notebook for this? – Trenton McKinney Oct 11 '18 at 07:15
  • @Trenton_M Yes I am – João Areias Oct 11 '18 at 09:49
  • Seems like you have a solution. **Also note that Jupyter is not very memory efficient and doesn't seem to release memory all that well.** I have much better memory efficiency using an IDE like PyCharm, which is free. My use case when doing something memory intensive is to develop in Jupyter and then export the notebook to a Python file. You can preprocess your DataFrame, save it to a csv and then reopen if for analysis. – Trenton McKinney Oct 12 '18 at 15:35

3 Answers3

4

The memory issue you are facing may be due to having multiple (sub) copies of the same dataframe. It is not necessary to do so in pandas, as everyone else pointed out, the .shift function can achieve what you need to.

First creating a pandas dataframe, that has two shops, i.e. 24 and 25.

df = pd.DataFrame({'shop_id':[24, 24, 24, 24, 24, 25, 25, 25, 25, 25],
                   'item_id': [2000, 2000, 2000, 3000, 3000, 1000, 1000, 1000, 1000, 1000], 
                   'date_block_num': [7, 8, 9, 7, 8, 5, 6, 7, 8, 9], 
                   'item_cnt_month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]})

+-------+-------+--------------+--------------+
|shop_id|item_id|date_block_num|item_cnt_month|
+-------+-------+--------------+--------------+
|     24|   2000|             7|             1|
|     24|   2000|             8|             2|
|     24|   2000|             9|             3|
|     24|   3000|             7|             4|
|     24|   3000|             8|             5|
|     25|   1000|             5|             6|
|     25|   1000|             6|             7|
|     25|   1000|             7|             8|
|     25|   1000|             8|             9|
|     25|   1000|             9|            10|
+-------+-------+--------------+--------------+

In shop 24, there are item 2000 and 3000.

There is 1 count of item 2000 in date block 7, 2 count in date block 8, etc.

The goal is to create an item_cnt_month lag column that has the value of item_cnt_month n months ago, for that item at that shop.

To create the lag features, you can use the function below.

def lag_features(df, lags, group_cols, shift_col):
    """
    Arguments:
        df (pd.DataFrame)
        lags (list((int)): the number of months to lag by
        group_cols (list(str)): the list of columns that need to be the merged key
        shift_col (str): the column name that is to be shifted by
    """

    for lag in lags:
        new_col = '{0}_lag_{1}'.format(shift_col, lag)
        df[new_col] = df.groupby(group_cols)[shift_col].shift(lag)

    return df

By calling

lags = [1, 2]
group_cols = ['shop_id', 'item_id']
shift_col = 'item_cnt_month'
order_col = 'date_block_num' 

df = df.sort_values(by=group_cols+[order_col], ascending=True)
df = lag_features(df, lags, group_cols, shift_col)

The result is this:

+-------+-------+--------------+--------------+--------------------+--------------------+
|shop_id|item_id|date_block_num|item_cnt_month|item_cnt_month_lag_1|item_cnt_month_lag_2|
+-------+-------+--------------+--------------+--------------------+--------------------+
|     24|   2000|             7|             1|                 NaN|                 NaN|
|     24|   2000|             8|             2|                 1.0|                 NaN|
|     24|   2000|             9|             3|                 2.0|                 1.0|
|     24|   3000|             7|             4|                 NaN|                 NaN|
|     24|   3000|             8|             5|                 4.0|                 NaN|
|     25|   1000|             5|             6|                 NaN|                 NaN|
|     25|   1000|             6|             7|                 6.0|                 NaN|
|     25|   1000|             7|             8|                 7.0|                 6.0|
|     25|   1000|             8|             9|                 8.0|                 7.0|
|     25|   1000|             9|            10|                 9.0|                 8.0|
+-------+-------+--------------+--------------+--------------------+--------------------+

Note that because there is no explicit join, you need to order the dataframe correctly by using the .sort_values(all key columns and date column)

Merelda
  • 1,318
  • 2
  • 12
  • 26
1

Calling df.compute() turns your full result into a Pandas Dataframe, so if your result doesn't fit into memory then Dask won't help you here.

Instead, it's more common to not call compute, and instead eventually compute some sort of aggregation that fits nicely in memory, or, if you need the full dataframe, then write it to disk with something like df.to_parquet()

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • This still uses too much memory taking too long, isn't there another alternative? The final data frame shouldn't use that much memory, is just one more column – João Areias Oct 10 '18 at 13:16
0

The function can be rewritten in Python 3.6+ as follows (requiring the dataframe to be presorted first):

df = df.sort_values(['date_block_num']).reset_index(drop=True)

def lag_feature(df, lags, col):
    key_columns = ['shop_id', 'item_id']
    for lag in lags:
        all_but_col = list(df.columns.difference([col]))
        df[f'{col}_lag_{lag}'] = (df.set_index(all_but_col)
                                    .groupby(level=key_columns)
                                    .shift(lag)
                                    .reset_index(drop=True))
    return df
ayorgo
  • 2,803
  • 2
  • 25
  • 35