Pandas filter for unique greater than 1 and concatenate the unique values

Question

I have a pandas dataframe:

df2 = pd.DataFrame({'c':[1,1,1,2,2,2,2,3],
                    'type':['m','n','o','m','m','n','n', 'p']})

And I would like to find which values of c have more than one unique type and for those return the c value, the number of unique types and all the unique types concatenated in one string.

I have used those two questions to get so far:

pandas add column to groupby dataframe Python Pandas: concatenate rows with unique values

df2['Unique counts'] = df2.groupby('c')['type'].transform('nunique')

df2[df2['Unique counts'] > 1].groupby(['c', 'Unique counts']).\
                                  agg(lambda x: '-'.join(x))

Out[226]: 
                    type
c Unique counts         
1 3                m-n-o
2 2              m-m-n-n

This works but I cannot get the unique values (so for example in the second row I would like to have only one m and one n. My questions would be the following:

Can I skip the in between step for creating the 'Unique counts' and create something temporary?
How can I filter for only unique values in the second step?

Order is important in output? – jezrael May 24 '19 at 09:15 — jezrael, May 24 '19 at 09:15
Not really. I just want the unique ones per group – User2321 May 24 '19 at 09:16 — User2321, May 24 '19 at 09:16

jezrael · Accepted Answer · 2019-05-24T09:25:59.057

Solution for remove unique rows first and then count values - create helper Series s and for unique strings is used sets:

s= df2.groupby('c')['type'].transform('nunique').rename('Unique counts')
a = df2[s > 1].groupby(['c', s]).agg(lambda x: '-'.join(set(x)))
print (a)

                  type
c Unique counts       
1 3              o-m-n
2 2                m-n

Another idea is removing duplicates first by DataFrame.duplicated:

df3 = df2[df2.duplicated(['c'],keep=False) & ~df2.duplicated(['c','type'])]
print (df3)

   c type
0  1    m
1  1    n
2  1    o
3  2    m
5  2    n

And then aggregate counts with join:

a = df3.groupby('c')['type'].agg([('Unique Counts', 'size'), ('Type', '-'.join)])
print (a)
   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n

Or if need all values aggregate first:

df4 = df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), 
                                  ('Type', lambda x: '-'.join(set(x)))])
print (df4)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n
3              1      p

And last remove unique rows by boolean indexing:

df5 = df4[df4['Unique Counts'] > 1]
print (df5)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n

Thank you very much, this works! I am new in Python and I come from R so I have the following questions: Is it possible to skip the in-between assignment to s and how does set work? — User2321, May 24 '19 at 09:20
@User2321 - set is an unordered collection of items. For more information is possible check [this](https://www.programiz.com/python-programming/set) — jezrael, May 24 '19 at 09:38

Chris Adams · Answer 2 · 2019-05-24T09:24:51.033

2

Use DataFrame.groupby.agg and pass tuple's of (column name, function):

df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), ('Type', lambda x: '-'.join(x.unique()))])

[out]

   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n
3              1      p

edited May 24 '19 at 09:24

answered May 24 '19 at 09:19

Chris Adams

18,389
4
22
39

score 1 · Answer 3 · answered May 24 '19 at 09:20

Use groupby.agg and filter on Unique counts column as you want:

df2 = (df2.groupby('c', as_index=False)
          .agg({'type': ['nunique', lambda x: '-'.join(np.unique(x))]}))
df2.columns = ['c','Unique counts','type']

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n
2  3              1      p

Filtering on Unique counts:

df2 = df2.loc[df2['Unique counts']>1,:]

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n

Pandas filter for unique greater than 1 and concatenate the unique values

3 Answers3