Optimising Pandas Function That Compares DataFrames

Question

I've got transcation logs which record usage for a kiosk machine and another set of logs for machine online/offline times. The transaction logs contains a datetime field which lets you know when the transaction (or session) occured.

    event_date  raw_data1   session_id  ws_id
0   2017-11-06 12:13:06 {'description': 'Home'} 0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
1   2017-11-06 12:13:41 {'description': 'AreYouStillThere'} 0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
2   2017-11-06 12:14:09 {'description': 'AttractiveAnimation'}  0604e80d-1ae6-48d0-81bf-32ca1dc58e4c    machine2
3   2017-11-07 10:06:15 {'description': 'Home'} e2e7565f-60b4-4e7b-a8f0-d0a9c384b283    machine13
4   2017-11-07 10:06:27 {'description': 'AuthenticationPanelAdmin'} e2e7565f-60b4-4e7b-a8f0-d0a9c384b283    machine13

The goal of this function is to see which session_ids conincide with an offline log

    dtrange start   end status  machine_id
0   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:46:15 2017-11-17 15:01:15 2   12
1   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:47:02 2017-11-17 15:02:02 2   22
2   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:47:23 2017-11-17 15:02:23 2   18
3   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:48:09 2017-11-17 15:03:09 2   17
4   DateTimeTZRange(datetime.datetime(2017, 11, 17...   2017-11-17 14:49:18 2017-11-17 15:04:18 2   15

ws_id and machine_id are the same, and this makes it a little trickier as the session time and machine_id must match across both dataframes.

This is the code I'm using to return all session_ids that occured when a machine is offline. It filters the offline dataframe with each row from the transaction dataframe and returns a session_id if an offline event coincided with a session time:

def CheckSession(machinename, sessiontime, sessionid):
    if len(offlinedf[(offlinedf.start<sessiontime)
             &(offlinedf.end>sessiontime)
             &(offlinedf.name==machinename)])>0:
        return sessionid

sessions = df.apply(lambda row: CheckSession(row["name"], row["created_at1"], row["session_id"]), axis=1)

This builds the list of sessions, but it is very slow and the dataframes are quite large. I'm still learning how best to work with the pandas library - I was hoping to optimise it using some vectorization but haven't been able to work out how to build it that way.

score 1 · Answer 1 · answered May 02 '18 at 15:27

1

Consider merging df and offlinedf by name and then filter with query according to the logic inside your function. Then convert the filtered dataframe's sessionid column to a list.

session_df = df.merge(offlinedf, on='name', suffixes=['', '_'])\
               .query('start < created_at1 & end > created_at1') 

sessions = session_df['sessionid'].tolist()

In any data analysis work, blockwise handling of objects are better than iterative row processing.

answered May 02 '18 at 15:27

Parfait

104,375
17
94
125

I gave this a go, couldn't help but run into memory errors unfortunately – Leon Kyriacou May 03 '18 at 11:16
How large are these datasets? I know logs can extend to the GBs. Do they originate in a database as you can run same merge and filter with SQL `JOIN` and import result into pandas. – Parfait May 03 '18 at 14:05
Yeah they're large, I agree - an SQL query may be the best solution to my problem. Thanks for the help Parfait :) – Leon Kyriacou May 03 '18 at 14:59
Curious and once again, where do these datasets derive? From text files? – Parfait May 03 '18 at 15:30
Stored in a sql tables on our server. So far I've been able to download them locally to analyse as it's quite small – Leon Kyriacou May 03 '18 at 15:53
Then consider running an `INNER JOIN` and `WHERE` clause and then import result to pandas with [`read_sql`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html) or python cursors with [results binded to dataframe](https://stackoverflow.com/a/12060886/14224510). – Parfait May 03 '18 at 15:59

Optimising Pandas Function That Compares DataFrames

1 Answers1