Create many lagged variables

Question

I have the following Python datatable:

import datatable
import numpy as np

np.random.seed(42)

dt = datatable.Frame({"A":np.repeat(np.arange(0, 2), 5), "B":np.random.normal(0, 1, 10)})
dt

#          A        B
#0         0        −0.342855
#1         0        0.0706784
#2         0        0.0470259
#3         0        −0.0522357
#4         0        −0.610938
#5         1        −2.62617
#6         1        0.550128
#7         1        0.538717
#8         1        −0.487166
#9         1        0.996788

I want to create 4 lagged columns for the column B for each value in column A. This will result in the following datatable:

#          A        B               B_lag_1         B_lag_2         B_lag_3         B_lag_4
#0         0        −0.342855       NA              NA              NA              NA
#1         0        0.0706784       −0.342855       NA              NA              NA
#2         0        0.0470259       0.0706784       −0.342855       NA              NA
#3         0        −0.0522357      0.0470259       0.0706784       −0.342855       NA
#4         0        −0.610938       −0.0522357      0.0470259       0.0706784       −0.342855
#5         1        −2.62617        NA              NA              NA              NA
#6         1        0.550128        −2.62617        NA              NA              NA
#7         1        0.538717        0.550128        −2.62617        NA              NA
#8         1        −0.487166       0.538717        0.550128        −2.62617        NA
#9         1        0.996788        −0.487166       0.538717        0.550128        −2.62617

How can I achieve this?

with pandas you could try to use `groupby`("A")` to work with every A separatelly, and `shift()` to move values to next rows. — furas, Jun 01 '22 at 17:27

furas · Accepted Answer · 2022-06-01T17:45:15.707

I never used datatable but pandas.DataFrame has groupby() and shift() and I found similar functions in datatable.

You can use:

by("A") to group rows by value in column A and work in every group separatelly
shift(datatable.f.B, n) to move values n-rows down in column B.

import datatable as dt
import numpy as np

np.random.seed(42)

df = dt.Frame({
    "A": np.repeat(np.arange(0, 2), 5), 
    "B": np.random.normal(0, 1, 10)
})

for n in range(1, 5):
    df[f'B_lag_{n}'] = df[:, dt.shift(dt.f.B, n), dt.by('A')]['B']
    
df

Result

   |     A          B    B_lag_1    B_lag_2    B_lag_3    B_lag_4
   | int64    float64    float64    float64    float64    float64
-- + -----  ---------  ---------  ---------  ---------  ---------
 0 |     0   0.496714  NA         NA         NA         NA       
 1 |     0  -0.138264   0.496714  NA         NA         NA       
 2 |     0   0.647689  -0.138264   0.496714  NA         NA       
 3 |     0   1.52303    0.647689  -0.138264   0.496714  NA       
 4 |     0  -0.234153   1.52303    0.647689  -0.138264   0.496714
 5 |     1  -0.234137  NA         NA         NA         NA       
 6 |     1   1.57921   -0.234137  NA         NA         NA       
 7 |     1   0.767435   1.57921   -0.234137  NA         NA       
 8 |     1  -0.469474   0.767435   1.57921   -0.234137  NA       
 9 |     1   0.54256   -0.469474   0.767435   1.57921   -0.234137
[10 rows x 6 columns]

Create many lagged variables

1 Answers1

Linked