2

I am currently working with a panel data of financial information on pandas, therefore working with different companies across different years. I am trying to generate a column of the $ invested shifted by 2 time periods. Hence, reporting the value of time t also at t+2.

Normally, to lag a variable, I would use df.groupby('tic')['investments'].shift(2) , however unfortunately my data is a bit spotty and therefore for the same company I might have a gap in the years. Just to give you an idea, here is an example of my df:

       datadate    fyear       tic   invest   
0    31/12/1998     1997      AAPL     12.3  
1    31/12/1999     1998      AAPL     14.5
2    31/12/2002     2002      AAPL      9.7   
3    31/12/2003     2003      AAPL     21.8   
4    31/12/2004     2004      AAPL     21.4
5    31/12/2005     2005      AAPL     18.9   
6    31/05/2008     2008      TSLA     11.5    
7    31/05/2009     2009      TSLA     13.7    
8    31/05/2010     2010      TSLA     19.4 
9    31/05/2011     2011      TSLA     14.5   
10   31/05/2014     2013      TSLA     14.8 
..      ...         ..         ..      ..    

Therefore, what I am trying to achieve is the following:

       datadate    fyear       tic   invest   l2_invest
0    31/12/1998     1997      AAPL     12.3         NaN
1    31/12/1999     1998      AAPL     14.5         NaN
2    31/12/2002     2002      AAPL      9.7         NaN
3    31/12/2003     2003      AAPL     21.8         NaN
4    31/12/2004     2004      AAPL     21.4         9.7
5    31/12/2005     2005      AAPL     18.9        21.8
6    31/05/2008     2008      TSLA     11.5         NaN
7    31/05/2009     2009      TSLA     13.7         NaN
8    31/05/2010     2010      TSLA     19.4        11.5
9    31/05/2011     2011      TSLA     14.5        13.7
10   31/05/2014     2013      TSLA     14.8        19.4
..      ...         ..         ..      ..    

Thank you in advance for your help :)

tigio_33
  • 149
  • 8

1 Answers1

3

Provided that the 'datadate' column is the table's index (and of type datetime64), the following code should produce the desired additional column:

df.groupby('tic')['invest'].shift(1, freq=pd.DateOffset(years=2))

Edit: it is still necessary to append this new column to the original table. Here are some more details.

First, make sure the 'datadate' column is of type datetime64:

df['datadate'] = pd.to_datetime(df['datadate'])

Then let's create the new column and append it to our table:

new_column = (df.set_index('datadate')
                .groupby('tic')['invest']
                .shift(1, freq=pd.DateOffset(years=2)))
df.set_index(['tic', 'datadate'], inplace=True)
df['l2_invest'] = new_column
df.reset_index(inplace=True)

Starting with the original table, this produces

     tic   datadate  fyear  invest  l2_invest
0   AAPL 1998-12-31   1997    12.3        NaN
1   AAPL 1999-12-31   1998    14.5        NaN
2   AAPL 2002-12-31   2002     9.7        NaN
3   AAPL 2003-12-31   2003    21.8        NaN
4   AAPL 2004-12-31   2004    21.4        9.7
5   AAPL 2005-12-31   2005    18.9       21.8
6   TSLA 2008-05-31   2008    11.5        NaN
7   TSLA 2009-05-31   2009    13.7        NaN
8   TSLA 2010-05-31   2010    19.4       11.5
9   TSLA 2011-05-31   2011    14.5       13.7
10  TSLA 2014-05-31   2013    14.8        NaN
  • Hey Vit! Thanks for your contribution, however, the code doesn't really work :P or at least, it does not what I am looking for. – tigio_33 Apr 15 '20 at 18:32
  • Hello, I added some more details. This made the whole code less concise than I expected, so I am sorry if you are looking for a more elegant solution. Also, my output at line 10 differs from that in your example table, which I suspect is a mistake in the example table. – Vít Navrátil Apr 15 '20 at 19:08
  • Hey Vit! Thanks for your edit! I tried to run the code but I received the following error: `AttributeError: 'DataFrame' object has no attribute 'DateOffset'`. Is there a package which I need to import? – tigio_33 Apr 15 '20 at 19:26
  • Hi Tigio, I am using `pd` in `pd.DateOffset` as an alias for the pandas library (which I imported using `import pandas as pd`). Maybe you are using a different alias? – Vít Navrátil Apr 15 '20 at 19:31
  • No I'm using the exact same one... that's very strange :P I'll look into it and get back to you - meanwhile, thank you for your help! – tigio_33 Apr 15 '20 at 20:08
  • Your error message looks like you are using `pd` as a variable name of a DataFrame. – Vít Navrátil Apr 15 '20 at 20:19
  • Hey Vit! Yeah it works now for some reason :P Thank you for your help! – tigio_33 Apr 17 '20 at 08:34