0

I have a dataframe of visits that includes a person ID column, and a given person may have more than one visit. I want to number the visits for a given person.

I can sort the dataframe by visit date, then iterate and count the number of times a person ID has occurred, or its occurrences between the first index and current index. Is there a more efficient (ideally vectorized) way? (Note that I don't want the total counts as from value_counts(), I want a new series that is a running count of visits per person.)

Here is a dummy dataframe where person 'a' has two visits:

In[1]: df = pd.DataFrame({'Visit': [0, 1, 2, 3, 4, 5], 'Person_ID': ['a', 'b', 'c', 'a', 'd', 'e']})
In[2]: df['Person_ID'].value_counts()
Out[2]:
a    2
b    1
c    1
d    1
e    1
Name: Person_ID, dtype: int64

Example 1: incrementing dictionary counter

In[3]: person_visit_count = {}
       for index, row in df.iterrows():
           if row['Person_ID'] not in person_visit_count:
               person_visit_count[row['Person_ID']] = 1
           else:
               person_visit_count[row['Person_ID']] += 1
           df.loc[index, 'Person_Visit'] = person_visit_count[row['Person_ID']]
In[4]: df
Out[4]:
   Visit Person_ID  Person_Visit
0      0         a           1.0
1      1         b           1.0
2      2         c           1.0
3      3         a           2.0
4      4         d           1.0
5      5         e           1.0

Example 2: index-based counter

In[5]: for index, row in df.iterrows():
           df.loc[index, 'Person_Visit2'] = (df.loc[0:index, 'Person_ID'] == row['Person_ID']).sum()
In[6]: df
Out[6]: 
   Visit Person_ID  Person_Visit  Person_Visit2
0      0         a           1.0            1.0
1      1         b           1.0            1.0
2      2         c           1.0            1.0
3      3         a           2.0            2.0
4      4         d           1.0            1.0
5      5         e           1.0            1.0
cottontail
  • 10,268
  • 18
  • 50
  • 51
  • 1
    IIUC you are looking for a grouped cumulative count -> ```df.groupby(['Person_ID'])['Visit'].cumcount() + 1``` which you can assign back as a column – sophocles Aug 12 '22 at 17:18

2 Answers2

1

Iterating is rarely the best option, you can group and do a cumulative count of the rows

df['Person_Visit']=df.groupby('Person_ID').transform('cumcount')+1

    Visit   Person_ID   Person_Visit
0   0       a           1
1   1       b           1
2   2       c           1
3   3       a           2
4   4       d           1
5   5       e           1

Just for comparison, here are the timings for the solutions you gave and this one (given by the timeit module)

groupby with cumcount

1.37 ms ± 177 µs per loop

incrementing dictionary counter

2.02 ms ± 321 µs per loop

index-based counter

3.61 ms ± 389 µs per loop
G. Anderson
  • 5,815
  • 2
  • 14
  • 21
  • This looks like exactly what I wanted. Thanks @g-anderson. For extra education, what's the different between your line, using `transform`, and @sophocles using `[Visit].cumcount`? – Paul Wright Aug 15 '22 at 08:31
  • It's basically the same operation "under the hood", but using `transform` is a stylistic choice as a way of doing the `which you can assign back as a column` step they mentioned in their comment. In this use case, `df['Person_Visit']=df.groupby(['Person_ID'])['Visit'].cumcount() + 1` runs in `1.1 ms ± 140 µs per loop` which is slightly faster, but in the same range, and YMMV for larger datasets or more complex transformations. Personally, I like to use the `.transform` method for all relevant applications both for it's versatility and readability at a glance – G. Anderson Aug 15 '22 at 14:49
0

Here's one idea. I thought of it looking at value_counts output you created

d = df['Person_ID'].value_counts().apply(lambda n: iter(range(1, n+1)))
df['PersonVisit'] = df['Person_ID'].apply(lambda id: next(d[id]))

In this solution, d acts as a key-value data structure where key is the person id and the value is an iterable, where each call to next() increments the value by 1.

pwasoutside
  • 343
  • 1
  • 10