0

I am new to python and pandas. I am trying to assign new session IDs for around 2270 users, based on the time difference between the timestamps. If the time difference exceeds 4 hours, I want a new session ID. Otherwise, it would have to remain the same. In the end, I want a modified data frame with the new session ID column. Here is what I have so far:

Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
users = Eh_2016['Username'].unique() #find the number of unique Usernames
for user in users: #start of the loop
    idx = Eh_2016.index[Eh_2016.Username == user] #Slice the original df
    temp = Eh_2016[Eh_2016.Username == user] #Create a temp placeholder for the slice
    counter = 1 # Initialize counter for NewSessionIDs
    for i,t in enumerate(temp['Timestamp']): #Looping for each timestamp value
        if i != 0 : 
            temp['elapsed'].iloc[i] = (t - temp['Timestamp'].iloc[i-1]) #Calculate diff btwn timestamps
            if temp['elapsed'].iloc[i] > datetime.timedelta(hours = 4): #If time diff>4
                counter +=1 #Increase counter value
                temp['NewSessionID'].iloc[i]=counter #Assign new counter value as NewSessionID
            else:
                temp['NewSessionID'].iloc[i] = counter #Retain previous sessionID
    Eh_2016.loc[idx,:]= temp #Replace original df with the updated slice

Any help on how to make this faster would be greatly appreciated! Let me know if you need more details. Thanks in advance.

Edit: Sample DF

        Username               Timestamp  NewSessionID                  Elapsed
126842  1095513 2016-06-30 20:58:30.477             1                 00:00:00
126843  1095513 2016-07-16 07:54:47.986             2  15 days 10:56:17.509000
126844  1095513 2016-07-16 07:54:47.986             2          0 days 00:00:00
126845  1095513 2016-07-16 07:55:10.986             2          0 days 00:00:23
126846  1095513 2016-07-16 07:55:13.456             2   0 days 00:00:02.470000
        ...                     ...           ...                      ...
146920  8641894 2016-08-11 22:26:14.051            31   0 days 04:50:23.415000
146921  8641894 2016-08-11 22:26:14.488            31   0 days 00:00:00.437000
146922  8641894 2016-08-12 20:01:02.419            32   0 days 21:34:47.931000
146923  8641894 2016-08-23 10:19:05.973            33  10 days 14:18:03.554000
146924  8641894 2016-09-25 11:30:35.540            34  33 days 01:11:29.567000
  • Welcome in SO. Please provide a [mcve](/help/mcve) – rpanai Apr 03 '20 at 18:05
  • Please include a sample source DataFrame and the expected result. – Valdi_Bo Apr 03 '20 at 18:14
  • Oh yea, no. Ruff. If you have a column with the date times you can use `pd.cut` to split it into months and assign them unique values that you can use to make random IDs. Check out this post https://stackoverflow.com/questions/43500894/pandas-pd-cut-binning-datetime-column-series There is another way to do this but I can't remember rn. – seeiespi Apr 03 '20 at 19:16

1 Answers1

0

Filtering the whole dataframe for each user is O(users*sessions), and it's not needed since you need to iterate over the whole thing anyway.

A more efficient approach would be to instead iterate over the dataframe in one pass, and store the temporary variables (counter, location of previous row, etc) in a separate dataframe indexed by users.

Eh_2016["NewSessionID"] = 1 #Initialize 'NewSessionID' column in df with 1
Eh_2016['elapsed'] = datetime.time(0,0,0,0) #Create an empty elapsed to calculate Time diff later
# create new dataframe of unique users
users = pd.DataFrame({'Username': Eh_2016['Username'].unique()}).set_index('Username') 
# one column for the previous session looked at for each user
users['Previous'] = -1
# one column for the counter variable
users['Counter'] = 0

# iterate over each row
for index, row in Eh_2016.iterrows(): #start of the loop
    user = row['Username']
    previous = users[user, 'Previous']
    if previous >= 0: # if this is not the first row for this user
        Eh_2016.loc[index, 'elapsed'] = (row['Timestamp'] - Eh_2016.loc[previous, 'Timestamp']) #Calculate diff btwn timestamps
        if Eh_2016.loc[index, 'elapsed'] > datetime.timedelta(hours = 4): #If time diff>4
            users[user,'Counter'] += 1 #Increase counter value
        Eh_2016.loc[index, 'NewSessionID'] = users[user,'Counter'] # Assign new counter value as NewSessionID
        users[user, 'Previous'] = index # remember this row as the latest row for this user
Christoph Burschka
  • 4,467
  • 3
  • 16
  • 31