Convert SQL Lag over PartitionBy query to pandas

Question

I have a table with some customer purchase data. I would like to know what was the entry time & exit time of the customer in the shop & I have written the below SQL query for that. How can I convert this to python pandas?

SELECT MyTable.*, 
                   LAG(EventTypeID, 1, 0) 
                      OVER ( PARTITION BY ID,Name
                         ORDER BY Time) AS LastEvent , 
                   LEAD(EventTypeID, 1, 0) 
                      OVER ( PARTITION BY ID,Name
                         ORDER BY Time) AS NextEvent 
                FROM DL.dbo.DataTable MyTable

Input:

+-------------+--------+--------+-------+
| EventTypeID |   ID   |  Name  | Time  |
+-------------+--------+--------+-------+
|           1 | QWERTY | Joseph | 10.20 |
|           1 | QWERTY | Joseph | 10.25 |
+-------------+--------+--------+-------+

Desired result:

+-------------+--------+--------+-------+-----------+-----------+
| EventTypeID |   ID   |  Name  | Time  | LastEvent | NextEvent |
+-------------+--------+--------+-------+-----------+-----------+
|      1      | QWERTY | Joseph | 10.20 |         0 |         1 |
|      1      | QWERTY | Joseph | 10.25 |         1 |         0 |
+-------------+--------+--------+-------+-----------+-----------+

would reading sql directly into dataframe with `pd.read_sql` be an option? — YevKad, Jan 05 '21 at 11:56
Please take look at this answer: https://stackoverflow.com/questions/30673209/pandas-compare-next-row — Lev Gelman, Jan 05 '21 at 12:36

score 0 · Answer 1 · answered Jan 05 '21 at 17:36

df['LastEvent'] = df.sort_values(by=['Time'], ascending=True)\
                       .groupby(['ID','Name'])['EventTypeID'].shift(1)

df['NextEvent'] = df.sort_values(by=['Time'], ascending=True)\
                       .groupby(['ID','Name'])['EventTypeID'].shift(-1)

Thanks to Lev Gelman for the direction. The above code does the trick!

score 0 · Answer 2 · answered Dec 17 '22 at 22:27

I just figured this out for a more complex case. We have to use df.join() here because the shift() method shifts the index of the dataframe and .join() works on the index. For a reason I've not figured out yet, using the 'on' operator in .join() really goofs this up.

#### Experiment - create a dataframe with two groups, shift by 1 to lag, join to get lags
obsgrp1 = ['Ring','Ring','Ring','Ring','Ring','Ring','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo','Arlo']
obsgrp2 = ['Door','Door','Side','Door','Side','Side','Roof','Window','Window','Window','Roof','Roof','Roof','Window','Roof']
obsdate = ['2022-10-03','2022-10-05','2022-10-06','2022-09-30','2022-10-12','2022-10-11','2022-10-10'
    ,'2022-10-02','2022-10-06','2022-10-07','2022-10-08','2022-10-13','2022-10-14','2022-10-15','2022-10-18','2022-10-20','2022-10-21','2022-10-22','2022-10-25','2022-10-30']
obsval = [37,41,16,22,33,31,40,93,88,89,99,92,97,94,91,92,93,90,88,87]

_grp = ['obsGroup1','obsGroup2']
_dat = ['obsDate','obsVal']

df = pd.DataFrame(zip(obsgrp1,obsgrp2,obsdate,obsval), columns=['obsGroup1','obsGroup2','obsDate','obsVal'])\
    .sort_values(by=['obsGroup1','obsGroup2','obsDate'], ascending=[True,False,False])
print(df.head(15))

# Shift 
sdf = df.groupby(_grp)[_dat].shift(-1)
print(sdf.head(15))

# Check
print(df.index)
print(sdf.index)

# Join
jdf = df.join(sdf, rsuffix = '_prev', how='left')
jdf.head(15)

Convert SQL Lag over PartitionBy query to pandas

2 Answers2