0

Let's assume, I have the following data frame.

Id   Combinations
1      (A,B)
2      (C,)
3      (A,D)
4      (D,E,F)
5      (F)

I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.

The result I am looking for is:

ID     Combination     Frequency
1        A                2               
1        B                1
3        A                2
3        D                2
4        D                2
4        E                1
4        F                2

Can anyone help to get the above result in Python pandas?

Hiwot
  • 568
  • 5
  • 18
  • 1
    What did you try until now ? Did you take a look at https://stackoverflow.com/questions/27263805/pandas-column-of-lists-create-a-row-for-each-list-element ? – SpaceBurger Mar 03 '21 at 08:51

1 Answers1

2

First if necessary convert values to lists:

df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')

If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:

df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
   Id Combinations  Frequency
0   1            A          2
0   1            B          1
2   3            A          2
2   3            D          2
3   4            D          2
3   4            E          1
3   4            F          1

Or if need count before removing them filter them by Series.duplicated in last step:

df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())

df2 = df2[df2['Id'].duplicated(keep=False)]

Alternative:

df2 = df2[df2.groupby('Id').Id.transform('size') > 1] 

Or:

df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]

print (df2)
   Id Combinations  Frequency
0   1            A          2
0   1            B          1
2   3            A          2
2   3            D          2
3   4            D          2
3   4            E          1
3   4            F          2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • the solution that I am looking for is exactly your second solution. But when I tried it, I am getting empty data frame. Can you check it once again, please? – Hiwot Mar 03 '21 at 09:24
  • @Hiwot - What is `print (df['Combinations'].head().tolist())` of your input data? – jezrael Mar 03 '21 at 09:31
  • Yes, I can see the result--> ```['(A,B)', '(C,)', '(A,D)', '(D,E,F)', '(F,)']``` – Hiwot Mar 03 '21 at 09:35
  • @Hiwot - So you use `df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')` in first step? – jezrael Mar 03 '21 at 09:36
  • I used ```df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')``` then put the second solution after that. I got this error: ```ValueError: keep must be either "first", "last" or False``` To solve this, I added ```df.drop_duplicates(['Id'],keep='first')``` but same. – Hiwot Mar 03 '21 at 09:43
  • I solved it by replacing ```df = df[df['Id'].duplicated(keep=False)]``` with ```df = df[df_sample.groupby('Id').Id.transform(len) > 1]``` – Hiwot Mar 03 '21 at 09:56
  • @Hiwot - super, or fatser alternative, very similar is `df = df[df.groupby('Id').Id.transform('size') > 1] ` – jezrael Mar 03 '21 at 09:58
  • @Hiwot - I rename solutions for `df1` and `df2`. – jezrael Mar 03 '21 at 10:05