Looking for a way to remove subsets from columns in dataframe

Question

I have a dataframe that is formatted the following way -

'''
ids                        size
[A, B, C, D, E, F]         100
[C,D,E]                     50 
[C,D,E,F,G]                200
[D,E,F,G,H]                190
[E,F,G,H]                  100
[K, L, M, N]               200
'''

This dataframe has thousands of rows and numerous ID combinations. Dealing with lists is a bit of a pain. I am able to remove the [C, D, E] entry using issubset

What I would like to do is keep the unique id groupings that have the largest size (in this case, C, D, E, F, G). Because the other entries common to the largest one, I am not interested in those. The only ones that should survive are C, D, E, F, G and K, L, M, N. Is there a way to handle this in Pandas?

score 0 · Answer 1 · answered Jun 12 '20 at 19:52

0

I'm not sure what it is exactly you want, but you can filter by some minimum

    minimumVal = 195
    df = df[df['ids'] > minimumVal]

answered Jun 12 '20 at 19:52

greenPlant

482
4
16

Thanks... it isn't so much the value I want to use a threshold. I am looking for the maximum value for each combination of ids while removing the overlapping clusters. The key for me is to keep the maximum value for a unique set of ids that don't overlap each other. – Mike D Jun 13 '20 at 03:48
https://stackoverflow.com/questions/12497402/python-pandas-remove-duplicates-by-columns-a-keeping-the-row-with-the-highest maybe take a look at this. I'm not sure what you mean by overlapping clusters - could you add a clear example to original post of whats going on with that? – greenPlant Jun 14 '20 at 08:34

Looking for a way to remove subsets from columns in dataframe

1 Answers1