Sampling one record per unique value (pandas, python)

Question

I work with python-pandas dataframes, and I have a large dataframe containing users and their data. Each user can have multiple rows. I want to sample 1-row per user. My current solution seems not efficient:

df1 = pd.DataFrame({'User': ['user1', 'user1', 'user2', 'user3', 'user2', 'user3'],
                 'B': ['B', 'B1', 'B2', 'B3','B4','B5'],
                 'C': ['C', 'C1', 'C2', 'C3','C4','C5'],
                 'D': ['D', 'D1', 'D2', 'D3','D4','D5'],
                 'E': ['E', 'E1', 'E2', 'E3','E4','E5']},
                 index=[0, 1, 2, 3,4,5])

df1
>>  B   C   D   E   User
0   B   C   D   E   user1
1   B1  C1  D1  E1  user1
2   B2  C2  D2  E2  user2
3   B3  C3  D3  E3  user3
4   B4  C4  D4  E4  user2
5   B5  C5  D5  E5  user3

userList = list(df1.User.unique())
userList
> ['user1', 'user2', 'user3']

The I loop over unique users list and sample one row per user, saving them to a different dataframe

usersSample = pd.DataFrame() # empty dataframe, to save samples
for i in userList:
    usersSample=usersSample.append(df1[df1.User == i].sample(1)) 

> usersSample   
B   C   D   E   User
0   B   C   D   E   user1
4   B4  C4  D4  E4  user2
3   B3  C3  D3  E3  user3

Is there a more efficient way of achieving that? I'd really like to: 1) avoid appending to dataframe usersSample. This is gradually growing object and it seriously kills performance. And 2) avoid looping over users one at a time. Is there a way to sample 1-per-user more efficiently?

score 40 · Accepted Answer · answered Jul 15 '16 at 07:30

40

This is what you want:

df1.groupby('User').apply(lambda df: df.sample(1))

Without the extra index:

df1.groupby('User', group_keys=False).apply(lambda df: df.sample(1))

answered Jul 15 '16 at 07:30

piRSquared

285,575
57
475
624

1

We could do `lambda df: df.head(1)` to get the first value, in case order is relevant. – skadoosh Aug 29 '19 at 19:24

score 3 · Answer 2 · answered Jul 15 '16 at 07:50

3

Based on number of rows per user this might be faster:

df.sample(frac=1).drop_duplicates(['User'])

answered Jul 15 '16 at 07:50

ayhan

70,170
20
182
203

This is quicker on `df1`. Slower on `pd.concat([df for _ in range(100000)])` – piRSquared Jul 15 '16 at 07:56
@piRSquared I tried it on a larger dataset but User column's type was integer (117ms vs 3.96s) but it seems when the type is object it is always slower. – ayhan Jul 15 '16 at 08:06
@piRSquared Ah sorry, the difference comes from the number of users. The number of users is quite small in this example. – ayhan Jul 15 '16 at 08:09
2

`df= pd.DataFrame('user' + pd.Series(np.random.choice(range(1000), (1000000))).astype(str), columns=['User'])` 358 ms for you, 4.22 s for me. – piRSquared Jul 15 '16 at 08:17

score 1 · Answer 3 · answered Jul 15 '16 at 07:32

1

df1_user_sample_one = df1.groupby('User').apply(lambda x:x.sample(1))

Using DataFrame.groupby.apply and lambda function to sample 1

answered Jul 15 '16 at 07:32

TED Zhao

59
3

score 0 · Answer 4 · answered Jun 25 '20 at 21:04

0

.drop_duplicates should work just fine:

df1.drop_duplicates(subset='User')

This will keep each first occurrence of a value in the column 'User' and return the respective row.

answered Jun 25 '20 at 21:04

11lll

9
2

2

_This will keep each first occurrence of a value_ Isn't the question about a sample? – AMC Jun 25 '20 at 21:21

Sampling one record per unique value (pandas, python)

4 Answers4