0

Using the Pandas API for Python, for each timestamp, I want to calculate the number of unique devices seen in an account in the last 48 hours from each of the timestamps.

This is what my data looks like:

╔═════════════════════╦══════════╦═══════════╗
║      timestamp      ║  device  ║ accountid ║
╠═════════════════════╬══════════╬═══════════╣
║ 2018-10-29 18:52:30 ║ d1ed6e6  ║ DhHUXPw   ║
║ 2018-11-01 18:52:30 ║ d123ff96 ║ zgffRDY   ║
║ 2018-11-01 20:53:30 ║ e322ff96 ║ zgffRDY   ║
║ 2018-11-02 21:33:30 ║ g133gf42 ║ zgffRDY   ║
║ 2018-11-15 18:52:30 ║ d123ff96 ║ awfdsct   ║
║ 2018-11-17 08:25:30 ║ d123ff96 ║ awfdsct   ║
╚═════════════════════╩══════════╩═══════════╝

I except the output to look like this. Essentially, for account zgffRDY at 2018-11-02 21:33:30, we have seen 3 unique devices in the last 48 hours, whereas at 2018-11-01 18:52:30, we have only seen 1 device (being the current device)

╔═════════════════════╦══════════╦═══════════╦═══════════════════════════╗
║      timestamp      ║  device  ║ accountid ║ last_48hour_device_count  ║
╠═════════════════════╬══════════╬═══════════╬═══════════════════════════╣
║ 2018-10-29 18:52:30 ║ d1ed6e6  ║ DhHUXPw   ║                         1 ║
║ 2018-11-01 18:52:30 ║ d123ff96 ║ zgffRDY   ║                         1 ║
║ 2018-11-01 20:53:30 ║ e322ff96 ║ zgffRDY   ║                         2 ║
║ 2018-11-02 21:33:30 ║ g133gf42 ║ zgffRDY   ║                         3 ║
║ 2018-11-15 18:52:30 ║ d123ff96 ║ awfdsct   ║                         1 ║
║ 2018-11-16 08:25:30 ║ d123ff96 ║ awfdsct   ║                         1 ║
╚═════════════════════╩══════════╩═══════════╩═══════════════════════════╝

My current code looks like this.

count_list = []

for idx, row in df.iterrows():
    account = row['accountid']
    earliest = row['timestamp'] - pd.to_timedelta('48 hours')
    current_time = row['timestamp']

    filtered_data = df.query('timestamp >= @earliest and '
                                    'timestamp < @current_time and '
                                    'accountid == @account')

    device_cnt = len(set(filtered_data['device']))
    count_list.append(device_cnt)

df['last_48hour_device_count'] = count_list

I get the correct output, but my code runs way too slow, and I have a dataset with lots of observations in it.

Do you know of a better way to solve this?

pandasuser
  • 450
  • 4
  • 5
  • 1
    Possible duplicate of [Pandas count(distinct) equivalent](https://stackoverflow.com/questions/15411158/pandas-countdistinct-equivalent) – aghast Jan 25 '19 at 01:46

2 Answers2

0

The logic you're applying isn't entirely clear based on the description, but the pandas groupby method should give what you're looking for based on what you describe.

The call would look something like this:

df.groupby(['timestamp','accountId']).cumcount()
Chris
  • 1,401
  • 4
  • 17
  • 28
  • Sorry, I needed to clarify my question. I have updated it now. Hope it's more helpful! – pandasuser Jan 25 '19 at 01:34
  • same method holds. you'd just swap out 'device' for 'accountId', as I've now done above. did you try it? – Chris Jan 25 '19 at 01:37
  • Yes I did, but that's not what I exactly need. I need the cumulative count over the last 48 hours at each timestamp. – pandasuser Jan 25 '19 at 01:43
0

You are focused on account id's, so my suggestion would be to groupby the accountid field first.

With the device id field added in, this becomes very similar to this SO question. So I think your eventual result looks like:

cutoff = pd.Timestamp.now() - pd.to_timedelta('48 hours')
df = df[timestamp > cutoff]
df.groupby('accountid').device.nunique()
aghast
  • 14,785
  • 3
  • 24
  • 56
  • Yes that could have worked, but I need to calculate nunique() only over the last 48 hours of the timestamps in the 'timestamp' field. Does that help clarify my question? – pandasuser Jan 25 '19 at 04:27
  • The last 48 hours constraint is constant. Just populate your df with records that match (`df[timestamp > X]`) before you do anything else. – aghast Jan 25 '19 at 04:33
  • We are getting close. But I can't use `pd.Timestamp.now()` as the cutoff is different for each observation. For exampe, I'd have to do `cutoff = current_row.timestamp - pd.to_timedelta('48 hours')` for each of the observation in a for-loop, and the for-loop is what I am trying to avoid. – pandasuser Jan 25 '19 at 05:00