3

I have a large panel dataframe that includes multiple assets for multiple dates. The issue is that some assets don't have values every day, so I am looking to just have any blank values for the day be replaced by the last available value for that asset.

For example:

                       tradeDate   assetId  ticker    Sedol      price
0        2016-01-04 00:00:00.000      1786    3900  B17N9P6        1   
1        2016-01-04 00:00:00.000      2041    1898  B1JNK84        2   
2        2016-01-04 00:00:00.000      2981    CBMG  B9F9PM1        3   
3        2016-01-04 00:00:00.000      3547     MWA  B15RZR4        4   
4        2016-01-04 00:00:00.000      3570    TPLM  B065Y40        5   
5        2016-01-05 00:00:00.000      3995    MHGC  B0YRYS2        1   
6        2016-01-05 00:00:00.000      4110     DSW  B0BV2V6        2   
7        2016-01-05 00:00:00.000      4874     NWE  B03PGL4        3   
8        2016-01-05 00:00:00.000      4982    BOFI  B0637D4        4   
9        2016-01-05 00:00:00.000      5082   ISCTR  B03MYS8        5   
10       2016-01-05 00:00:00.000      5083   KCHOL  B03MVJ8        6

so what I need to do is index on tradeDate and assetIds, and then reindex so that it creates new rows for each tradedate so that there is always the same amount of assetIds. This will create rows with NaNs. I'm confused by this step because I'm not sure the syntax when using multiple indexes and when the columns tradeDate and assetId don't exactly match up.

Finally, I'm going to have to replace the NaNs with the last available value. However, again I have to do it for both indexes for tradeDate and assetId. Appreciate the help on this

Dick Thompson
  • 599
  • 1
  • 12
  • 26

1 Answers1

2

The first part of your question was cleverly answered here by TomAugspurger.

Basically you need to use a smart trick that consists of setting the Multiindex, unstacking your data frame, filling the missing values and stacking it again. Then you should convert filled missing values to NaNs again and use the .fillna() method implemented in pandas with method='ffill' in order to replace the NaNs with the last available value.

First, let's generate the data we need:

import pandas as pd
import numpy as np

data =  [['2016-01-04 00:00:00.000', 1786, '3900', 'B17N9P6', 1],
['2016-01-04 00:00:00.000', 2041, '1898', 'B1JNK84', 2],
['2016-01-04 00:00:00.000', 2981, 'CBMG', 'B9F9PM1', 3],
['2016-01-04 00:00:00.000', 3547, 'MWA', 'B15RZR4', 4],
['2016-01-04 00:00:00.000', 3570, 'TPLM', 'B065Y40', 5],
['2016-01-05 00:00:00.000', 3995, 'MHGC', 'B0YRYS2', 1],
['2016-01-05 00:00:00.000', 4110, 'DSW', 'B0BV2V6', 2],
['2016-01-05 00:00:00.000', 4874, 'NWE', 'B03PGL4', 3],
['2016-01-05 00:00:00.000', 4982, 'BOFI', 'B0637D4', 4],
['2016-01-05 00:00:00.000', 5082, 'ISCTR', 'B03MYS8', 5],
['2016-01-05 00:00:00.000', 5083, 'KCHOL', 'B03MVJ8', 6]]

cols  = ['tradeDate','assetId','ticker','Sedol','price']

Then to tackle the problem:

df = pd.DataFrame.from_records(data, columns=cols)

# I used `-1` to fill the empty values, but use whatever value
# does not occur in the dataset
df = df.set_index(['tradeDate', 'assetId']).unstack().fillna(-1).stack()

# Once you have the data frame stacked again, you need to
# convert unknown values to np.nan again
df.replace(-1, np.nan, inplace=True)

# Fill with latest value for given assetId
df = df.groupby(level=['assetId']).fillna(method='ffill')

Hope this answer covers your question, let us know otherwise.

  • thanks so much! that's exactly what I needed, didn't know about the stack function. ffill is what I need to forward fill, the one thing is that I don't want to include assets that weren't available at all in the beginning start date (for example, an IPO or something). So this method still shows NAs for assets that couldn't be forward filled, which means there's no initial backward value to fill. That means if I don't want to see assets until they are first traded, I should delete rows with NAs- correct? Thanks again!! – Dick Thompson Nov 09 '17 at 17:23
  • Exactly. By this approach you will get a NaN for any date before first trade/IPO for a given asset. So you just need to drop these rows for example using `.dropna()` method. – Roni Chikhmous Nov 09 '17 at 19:24