Filter rows with more than 1 value in a set and count their occurrence pandas python

Question

Let's assume, I have the following data frame.

Id   Combinations
1      (A,B)
2      (C,)
3      (A,D)
4      (D,E,F)
5      (F)

I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.

The result I am looking for is:

ID     Combination     Frequency
1        A                2               
1        B                1
3        A                2
3        D                2
4        D                2
4        E                1
4        F                2

Can anyone help to get the above result in Python pandas?

What did you try until now ? Did you take a look at https://stackoverflow.com/questions/27263805/pandas-column-of-lists-create-a-row-for-each-list-element ? — SpaceBurger, Mar 03 '21 at 08:51

jezrael · Accepted Answer · 2021-03-03T10:05:15.640

2

First if necessary convert values to lists:

df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')

If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:

df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
   Id Combinations  Frequency
0   1            A          2
0   1            B          1
2   3            A          2
2   3            D          2
3   4            D          2
3   4            E          1
3   4            F          1

Or if need count before removing them filter them by Series.duplicated in last step:

df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())

df2 = df2[df2['Id'].duplicated(keep=False)]

Alternative:

df2 = df2[df2.groupby('Id').Id.transform('size') > 1]

Or:

df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]

print (df2)
   Id Combinations  Frequency
0   1            A          2
0   1            B          1
2   3            A          2
2   3            D          2
3   4            D          2
3   4            E          1
3   4            F          2

edited Mar 03 '21 at 10:05

answered Mar 03 '21 at 08:52

jezrael

822,522
95
1,334
1,252

the solution that I am looking for is exactly your second solution. But when I tried it, I am getting empty data frame. Can you check it once again, please? – Hiwot Mar 03 '21 at 09:24
@Hiwot - What is `print (df['Combinations'].head().tolist())` of your input data? – jezrael Mar 03 '21 at 09:31
Yes, I can see the result--> ```['(A,B)', '(C,)', '(A,D)', '(D,E,F)', '(F,)']``` – Hiwot Mar 03 '21 at 09:35
@Hiwot - So you use `df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')` in first step? – jezrael Mar 03 '21 at 09:36
I used ```df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')``` then put the second solution after that. I got this error: ```ValueError: keep must be either "first", "last" or False``` To solve this, I added ```df.drop_duplicates(['Id'],keep='first')``` but same. – Hiwot Mar 03 '21 at 09:43
I solved it by replacing ```df = df[df['Id'].duplicated(keep=False)]``` with ```df = df[df_sample.groupby('Id').Id.transform(len) > 1]``` – Hiwot Mar 03 '21 at 09:56
@Hiwot - super, or fatser alternative, very similar is `df = df[df.groupby('Id').Id.transform('size') > 1] ` – jezrael Mar 03 '21 at 09:58
@Hiwot - I rename solutions for `df1` and `df2`. – jezrael Mar 03 '21 at 10:05

Filter rows with more than 1 value in a set and count their occurrence pandas python

1 Answers1