The most efficient way to number the instance of a value in a Series

Question

I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.

I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don't want the total counts as from value_counts(), I want a new series that is a running count of visits per person.)

Here is a dummy dataframe where person 'a' has two visits:

In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a    2
b    1
c    1
d    1
e    1
Name: Person_ID, dtype: int64

Example 1: incrementing dictionary counter

In[3]: person_visit_count = {}
       for index, row in df.iterrows():
           if row['Person_ID'] not in person_visit_count:
               person_visit_count[row['Person_ID']] = 1
           else:
               person_visit_count[row['Person_ID']] += 1
           df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
   Visit Person_ID  Person_Visit
0      0         a           1.0
1      1         b           1.0
2      2         c           1.0
3      3         a           2.0
4      4         d           1.0
5      5         e           1.0

Example 2: index-based counter

In[5]: for index, row in df.iterrows():
           df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]: 
   Visit Person_ID  Person_Visit  Person_Visit2
0      0         a           1.0            1.0
1      1         b           1.0            1.0
2      2         c           1.0            1.0
3      3         a           2.0            2.0
4      4         d           1.0            1.0
5      5         e           1.0            1.0

IIUC you are looking for a grouped cumulative count -> ```df.groupby(['Person_ID'])['Visit'].cumcount() + 1``` which you can assign back as a column — sophocles, Aug 12 '22 at 17:18

score 1 · Accepted Answer · answered Aug 12 '22 at 17:21

1

Iterating is rarely the best option, you can group and do a cumulative count of the rows

df['Person_Visit']=df.groupby('Person_ID').transform('cumcount')+1

    Visit   Person_ID   Person_Visit
0   0       a           1
1   1       b           1
2   2       c           1
3   3       a           2
4   4       d           1
5   5       e           1

Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)

groupby with cumcount

1.37 ms ± 177 µs per loop

incrementing dictionary counter

2.02 ms ± 321 µs per loop

index-based counter

3.61 ms ± 389 µs per loop

answered Aug 12 '22 at 17:21

G. Anderson

5,815
2
14
21

This looks like exactly what I wanted. Thanks @g-anderson. For extra education, what's the different between your line, using `transform`, and @sophocles using `[Visit].cumcount`? – Paul Wright Aug 15 '22 at 08:31
It's basically the same operation "under the hood", but using `transform` is a stylistic choice as a way of doing the `which you can assign back as a column` step they mentioned in their comment. In this use case, `df['Person_Visit']=df.groupby(['Person_ID'])['Visit'].cumcount() + 1` runs in `1.1 ms ± 140 µs per loop` which is slightly faster, but in the same range, and YMMV for larger datasets or more complex transformations. Personally, I like to use the `.transform` method for all relevant applications both for it's versatility and readability at a glance – G. Anderson Aug 15 '22 at 14:49

score 0 · Answer 2 · answered Aug 12 '22 at 17:19

0

Here's one idea. I thought of it looking at value_counts output you created

d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n+1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))

In this solution, d acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.

answered Aug 12 '22 at 17:19

pwasoutside

343
1
10

I think `.apply` is still iterative under the hood. Thank you for taking the time to respond. – Paul Wright Aug 16 '22 at 15:49

The most efficient way to number the instance of a value in a Series

2 Answers2