Filtering dataframe based on column value_counts (pandas)

Question

I'm trying out pandas for the first time. I have a dataframe with two columns: user_id and string. Each user_id may have several strings, thus showing up in the dataframe multiple times. I want to derive another dataframe from this; one where only those user_ids are listed that have at least 2 or more strings associated to them.

I tried df[df['user_id'].value_counts()> 1], which I thought was the standard way to do this, but it yields IndexingError: Unalignable boolean Series key provided. Can someone clear out my concept and provide the correct alternative?

Related and probable dupe: https://stackoverflow.com/questions/30485151/python-pandas-exclude-rows-below-a-certain-frequency-count — EdChum, Jun 02 '17 at 13:08

score 9 · Accepted Answer · answered Jun 02 '17 at 13:07

9

I think you need transform, because need same index of mask as df. But if use value_counts index is changed and it raise error.

df[df.groupby('user_id')['user_id'].transform('size') > 1]

answered Jun 02 '17 at 13:07

jezrael

822,522
95
1,334
1,252

Can you explain what you mean by mask? – Hassan Baig Jun 02 '17 at 13:11
mask is condition like `df['user_id'].value_counts()> 1` – jezrael Jun 02 '17 at 13:14
2

A "mask" is basically a list of true or false values for a certain condition. Masks are normally used to subset data. Say you had a dataframe of dogs' names and ages, and you wanted only to look at dogs older than 5 years. A mask basically tests whether each row (each dog) is older than five years and returns a true-false-laiden series. – blacksite Jun 02 '17 at 13:16
@not_a_robot - Thank you very much for comment. – jezrael Jun 02 '17 at 13:17
@ZakS - Need define columns after groupby `test_filtered1 = test_filtered[test_filtered.groupby('disposition')['disposition'].transform('size')>100]` – jezrael Jul 08 '18 at 11:41

score 0 · Answer 2 · answered Nov 28 '19 at 11:14

You can simply do the following,

col = 'column_name'   # name of the column that you consider
n = 10                # how many occurrences expected to be appeared

df = df[df.groupby(col)[col].transform('count').ge(n)]

this should filter the dataframe as you need

score 0 · Answer 3 · edited Apr 22 '20 at 20:28

0

I had the same challenge and used:

df['user_id'].value_counts()[df['user_id'].value_counts() > 1]

Credits: blog.softhints

edited Apr 22 '20 at 20:28

David Buck

3,752
35
31
35

answered Apr 22 '20 at 20:23

afrologicinsect

181
13

score -1 · Answer 4 · answered Aug 09 '19 at 12:25

-1

l2 = ((df.val1.loc[df.val== 'Best'].value_counts().sort_index()/df.val1.loc[df.val.isin(l11)].value_counts().sort_index())).loc[lambda x : x>0.5].index.tolist()

answered Aug 09 '19 at 12:25

Aaka sh

39
1

Filtering dataframe based on column value_counts (pandas)

4 Answers4

Linked