Multi-indexing - pandas

Question

I'm trying to build my own stocks screener with pandas. I fetch data with the DataReader, from Morningstar. Here is my data structure :

'Close', 'Volume', 'Low', 'High' etc are columns;
'Date' are rows. I could be wrong, but I have chosen to index with 'Symbol' (AAPL, WMLP, etc), and 'Date' are a sublevel of index.

When I add columns to the right, with some stock metrics ('Daily Return', 'Rolling Mean 20', etc), I have a major problem : instead of writing a Nan value for calculation that cannot be made because there is no previous data to use, python uses the last data of the previous stock. See below the Daily Return of WMLP on 2018-07-02 :

df.reset_index(inplace=True)
df.set_index(['Symbol', 'Date'])
                      Close     High       Low        Open    Volume  \
Symbol Date                                                          
AAPL   2018-07-02  187.1800  187.300  183.4200  183.8200  17731343   
       2018-07-03  183.9200  187.950  183.5400  187.7900  13954806   
       2018-07-04  183.9200  183.920  183.9200  183.9200         0   
       2018-07-05  185.4000  186.410  184.2800  185.2600  16604247   
       2018-07-06  187.9700  188.434  185.2000  185.4200  17485245   
WMLP   2018-07-02    1.3700    1.520    1.3300    1.3700      9139   
       2018-07-03    1.4000    1.690    1.3800    1.6900      8075   
       2018-07-04    1.4000    1.400    1.4000    1.4000         0   
       2018-07-05    1.6000    1.600    1.4124    1.4124     29383   
       2018-07-06    1.7095    1.770    1.5200    1.6000     42599   

               Daily Return  
Symbol Date                      
AAPL   2018-07-02           NaN  
       2018-07-03     -0.017416  
       2018-07-04      0.000000  
       2018-07-05      0.008047  
       2018-07-06      0.013862  
WMLP   2018-07-02     -0.992712  
       2018-07-03      0.021898  
       2018-07-04      0.000000  
       2018-07-05      0.142857  
       2018-07-06      0.068437

I want the Daily Return of WMLP on 2018-07-02 to be NaN, without having to manually force it.

Is it a problem with my indexing method ? Or would you suggest a smarter data structure to avoid the problem ? Note that I understand that panels are deprecated, and I have not learned xarray yet.

EDIT Per you request, please find below the other bits of code that came in the beginning :

start = '2018-07-02'
end = '2018-07-07'
WMLP = web.DataReader('WMLP', 'morningstar', start, end)
AAPL = web.DataReader('AAPL', 'morningstar', start, end)
frames = [AAPL, WMLP]
df = pd.concat(frames)
df['Daily Return'] = df['Close'].pct_change(1)

EDIT : Following the response from Peter (which works), I tried to put the groupby just one time in the beginning - in order to make the later code more simple and readable :

df = df.groupby('Symbol')
df['Daily Return'] = df['Close'].pct_change(1)
df['Rolling 20'] = df['Close'].rolling(20).mean()

However this produces "TypeError: 'DataFrameGroupBy' object does not support item assignment"

I am grateful for the groupby solution, but would welcome other ideas

please post initial `df` and code used to calulate `rollingmean()` — Vivek Kalyanarangan, Aug 12 '18 at 07:59

score 0 · Answer 1 · answered Aug 12 '18 at 18:45

You can df.groupby('Symbol'), then apply pct_change(periods=1) to the Close values of each group. Starting with your MultiIndexed DataFrame df:

df['Daily Return'] = df.groupby('Symbol')['Close'].apply(lambda x: x.pct_change(1))
df

                      Close     High       Low      Open    Volume  Daily Return
Symbol Date                                                                     
AAPL   2018-07-02  187.1800  187.300  183.4200  183.8200  17731343           NaN
       2018-07-03  183.9200  187.950  183.5400  187.7900  13954806     -0.017416
       2018-07-04  183.9200  183.920  183.9200  183.9200         0      0.000000
       2018-07-05  185.4000  186.410  184.2800  185.2600  16604247      0.008047
       2018-07-06  187.9700  188.434  185.2000  185.4200  17485245      0.013862
WMLP   2018-07-02    1.3700    1.520    1.3300    1.3700      9139           NaN
       2018-07-03    1.4000    1.690    1.3800    1.6900      8075      0.021898
       2018-07-04    1.4000    1.400    1.4000    1.4000         0      0.000000
       2018-07-05    1.6000    1.600    1.4124    1.4124     29383      0.142857
       2018-07-06    1.7095    1.770    1.5200    1.6000     42599      0.068437

As usual, there's probably a more elegant or performant solution, but this should work for reasonable data sizes.

Thank you @Peter. It works indeed. I will be playing with this 'groupby' in order to fully understand it — citizen007, Aug 13 '18 at 02:05

Multi-indexing - pandas

1 Answers1