1

I have a pandas dataframe:

df2 = pd.DataFrame({'c':[1,1,1,2,2,2,2,3],
                    'type':['m','n','o','m','m','n','n', 'p']})

And I would like to find which values of c have more than one unique type and for those return the c value, the number of unique types and all the unique types concatenated in one string.

I have used those two questions to get so far:

pandas add column to groupby dataframe Python Pandas: concatenate rows with unique values

df2['Unique counts'] = df2.groupby('c')['type'].transform('nunique')

df2[df2['Unique counts'] > 1].groupby(['c', 'Unique counts']).\
                                  agg(lambda x: '-'.join(x))

Out[226]: 
                    type
c Unique counts         
1 3                m-n-o
2 2              m-m-n-n

This works but I cannot get the unique values (so for example in the second row I would like to have only one m and one n. My questions would be the following:

  1. Can I skip the in between step for creating the 'Unique counts' and create something temporary?
  2. How can I filter for only unique values in the second step?
skillsmuggler
  • 1,862
  • 1
  • 11
  • 16
User2321
  • 2,952
  • 23
  • 46

3 Answers3

2

Solution for remove unique rows first and then count values - create helper Series s and for unique strings is used sets:

s= df2.groupby('c')['type'].transform('nunique').rename('Unique counts')
a = df2[s > 1].groupby(['c', s]).agg(lambda x: '-'.join(set(x)))
print (a)

                  type
c Unique counts       
1 3              o-m-n
2 2                m-n

Another idea is removing duplicates first by DataFrame.duplicated:

df3 = df2[df2.duplicated(['c'],keep=False) & ~df2.duplicated(['c','type'])]
print (df3)

   c type
0  1    m
1  1    n
2  1    o
3  2    m
5  2    n

And then aggregate counts with join:

a = df3.groupby('c')['type'].agg([('Unique Counts', 'size'), ('Type', '-'.join)])
print (a)
   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n

Or if need all values aggregate first:

df4 = df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), 
                                  ('Type', lambda x: '-'.join(set(x)))])
print (df4)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n
3              1      p

And last remove unique rows by boolean indexing:

df5 = df4[df4['Unique Counts'] > 1]
print (df5)
   Unique Counts   Type
c                      
1              3  o-m-n
2              2    m-n
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you very much, this works! I am new in Python and I come from R so I have the following questions: Is it possible to skip the in-between assignment to s and how does set work? – User2321 May 24 '19 at 09:20
  • @User2321 - i thin yes if change another solution. – jezrael May 24 '19 at 09:23
  • @User2321 - set is an unordered collection of items. For more information is possible check [this](https://www.programiz.com/python-programming/set) – jezrael May 24 '19 at 09:38
2

Use DataFrame.groupby.agg and pass tuple's of (column name, function):

df2.groupby('c')['type'].agg([('Unique Counts', 'nunique'), ('Type', lambda x: '-'.join(x.unique()))])

[out]

   Unique Counts   Type
c                      
1              3  m-n-o
2              2    m-n
3              1      p
Chris Adams
  • 18,389
  • 4
  • 22
  • 39
1

Use groupby.agg and filter on Unique counts column as you want:

df2 = (df2.groupby('c', as_index=False)
          .agg({'type': ['nunique', lambda x: '-'.join(np.unique(x))]}))
df2.columns = ['c','Unique counts','type']

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n
2  3              1      p

Filtering on Unique counts:

df2 = df2.loc[df2['Unique counts']>1,:]

print(df2)
   c  Unique counts   type
0  1              3  m-n-o
1  2              2    m-n
Space Impact
  • 13,085
  • 23
  • 48