2

I have 100,000 rows of data in the following format:

import pandas as pd

data = {'ID': [1, 1, 3, 3, 4, 3, 4, 4, 4],
        'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:01', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}

df = pd.DataFrame(data)

   ID           timestamp
0   1 2014-12-23 16:53:00
1   1 2014-12-23 17:00:00
2   3 2014-12-23 17:01:00
3   3 2014-12-23 17:02:00
4   4 2014-12-23 17:00:00
5   3 2014-12-23 17:06:00
6   4 2014-12-23 17:15:00
7   4 2014-12-23 17:16:00
8   4 2014-12-23 17:20:00

The ID represents a user and the timestamp is when that user visited a website. I want to get information about sessions using pandas, where each session on this site is a maximum of 15 mins long. A new session starts once the user has been logged on for 15 mins. For the above sample data, the desired result would be:

   ID  session_start   session_duration
0  1   12/23/14 16:53. 7 min
1  3   12/23/14 17:02. 4 min
2  4   12/23/14 17:00. 15 min
3  4   12/23/14 17:16. 4 min

Let me know if there is information I should add to clarify. I can't seem to get a working solution. Any help is appreciated!

EDIT: In response to queries below, I noticed a mistake in my example. Sorry guys it was very late at night!

The problem that I am struggling with has mostly to do with user 4. They are still logged in after 15 minutes, and I want to capture in my data that a new session has started.

The reason that my problem is slightly different from this Groupby every 2 hours data of a dataframe is because I want to do this based on individual users.

ellen
  • 571
  • 7
  • 23
  • 1
    why your output doesn't contain one more entry for id = 1? id = 4 has 2 entries why id = 1 has one entry? – deadshot Aug 19 '20 at 05:45
  • why first date for group `4` is `12/23/14 17:02` and not `12/23/14 17:00` ? also why second is `12/23/14 17:02` ? – jezrael Aug 19 '20 at 05:55
  • Does this answer your question? [How to calculate time difference by group using pandas?](https://stackoverflow.com/questions/40804265/how-to-calculate-time-difference-by-group-using-pandas) – Trenton McKinney Aug 19 '20 at 05:55
  • It could be done with `g = df.groupby('ID'); duration = g.max() - g.min()` and then join with `g.first()` BUT you don't explain if there are more dates involved... what if ID 5 has `12/23/14 17:02` followed by `12/25/14 01:08` – RichieV Aug 19 '20 at 05:56
  • @jezrael maybe this one [Groupby every 2 hours data of a dataframe](https://stackoverflow.com/questions/48341009) is closer. I think you've answered all these at one time or another. – Trenton McKinney Aug 19 '20 at 06:00
  • Ya, here is main problem data from input data not match expected output (it seems missing some values, if P need new column, then length of output should be same like inpud data) – jezrael Aug 19 '20 at 06:02

1 Answers1

1

Not pretty, but here's a solution. The basic idea is to use groupby with diff to calculate differences between timestamps for each ID, but I couldn't find a nice way to only diff every 2 rows. So this approach uses diff for every row, then selects the diff result from every other within each ID.

Note that I'm assuming that the dataframe is properly ordered. Also note that your sample data had an extra entry for ID==1 that I removed.

import pandas as pd

data = {'ID': [1, 1, 3, 4, 3, 4, 4, 4],
        'timestamp': ['12/23/14 16:53', '12/23/14 17:00', '12/23/14 17:02', '12/23/14 17:00', '12/23/14 17:06', '12/23/14 17:15', '12/23/14 17:16', '12/23/14 17:20']}

df = pd.DataFrame(data)
df['timestamp']=pd.to_datetime(df['timestamp'])

# groupby to get difference between each timestamp 
df['diffs'] = df.groupby('ID')['timestamp'].diff()

# count every time ID appears 
df['counts'] = df.groupby('ID')['ID'].cumcount()+1

print("after diffs and counts:")
print(df)

# select entries for every 2nd occurence (where df['counts'] is even)
new_df = df[df['counts'] % 2 == 0][['ID','timestamp','diffs']]

# timestamp here will be the session endtime so subtract the 
# diffs to get session start time 
new_df['timestamp'] = new_df['timestamp'] - new_df['diffs']

# and a final rename
new_df = new_df.rename(columns={'timestamp':'session_start','diffs':'session_duration'})
print("\nfinal df:")
print(new_df)

Will print out

after diffs and counts:
   ID           timestamp           diffs  counts
0   1 2014-12-23 16:53:00             NaT       1
1   1 2014-12-23 17:00:00 0 days 00:07:00       2
2   3 2014-12-23 17:02:00             NaT       1
3   4 2014-12-23 17:00:00             NaT       1
4   3 2014-12-23 17:06:00 0 days 00:04:00       2
5   4 2014-12-23 17:15:00 0 days 00:15:00       2
6   4 2014-12-23 17:16:00 0 days 00:01:00       3
7   4 2014-12-23 17:20:00 0 days 00:04:00       4

final df:
   ID       session_start session_duration
1   1 2014-12-23 16:53:00  0 days 00:07:00
4   3 2014-12-23 17:02:00  0 days 00:04:00
5   4 2014-12-23 17:00:00  0 days 00:15:00
7   4 2014-12-23 17:16:00  0 days 00:04:00

Then to get session_duration column as number of minutes instead of a timedelta object, you can do:

import numpy as np
new_df['session_duration']  = new_df['session_duration'] / np.timedelta64(1,'s') / 60.
chris
  • 1,267
  • 7
  • 20
  • sorry to throw a monkey wrench in things, but I just realized that [counts]%2 won't work, since sometimes there are data points in between that should be counted as a single session. I've added another row with user ID 3 to illustrate this – ellen Aug 20 '20 at 02:53
  • 1
    well you can add up the diffs however you want. If you did a `df.groupby('ID')['diffs'].sum()` it'd accumulate all the differences by user. but honestly if you don't have any extra logging info you're not going to get an accurate session duration here since you can't differentiate between a 2 min and 14 min session unless there happens to be another timestamp between. I suggest you map out how you want to calculate the duration for a single `ID` before figuring out how to aggregate it. – chris Aug 20 '20 at 16:19