0

I have a dataset

Name    Subset    Type    System
A00     9-IU00-A  OP      A
A00     IT00      PP      A
B01     IT-01A    PP      B
B01     IU        OP      B
B03     IM-09-B   LP      A
B03     IM03A     OP      B
B03     IT-09     OP      A
D09     IT-A09    OP      B
D09     07IM-09A  LP      B
D09     IM        OP      A

So here I need to group the Name column such that Subset and Type are similar. We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM. Subset clusters are first alphabetical string withput any digit, hyphen, etc. extracted using

df['Subset'].str.extractall(r'[^a-zA-Z]*([a-zA-Z]+)[^,]*').groupby(level=0)[0].agg(','.join)})

Output needed

Subset Cluster    Type Cluster    Name        System
IU,IT             OP,PP           A00,B01     A,A,B,B
IM,IM,IT          LP, OP, OP      B03, D09    A,B,A,B,B,A

Here the first cluster instance is formed because IU is OP and IT is PP in both cases, similar for the second instance. How can a sequential Pattern mining be used here.

spd
  • 334
  • 1
  • 12
  • 1
    Looks like you always ask a variation of the [`same`](https://stackoverflow.com/questions/71795377/data-grouping-with-pandas) [`question`](https://stackoverflow.com/questions/71577325/data-clustering-using-pandas), [again](https://stackoverflow.com/questions/71719509/grouping-dataframe-wrt-three-columns-in-pandas) and [again](https://stackoverflow.com/questions/71724160/getting-data-from-different-dataframe) you should probably be able to perform the grouping logic now on you own, no? – mozway Apr 17 '22 at 06:35
  • Was just looking for Sequential Pattern Mining method. – spd Apr 18 '22 at 08:44
  • Sure but that doesn't explain why you ask again the question instead of finding how you could combine the existing logic with your new requirements. – mozway Apr 18 '22 at 09:10
  • Okay sure. Thanks a lot. – spd Apr 18 '22 at 09:19

1 Answers1

1

We could sort_values first, then use a chained groupby, once to aggregate by name, then again by subset and type clusters:

out = (df.assign(Subset=df['Subset'].str.extractall(r'[^a-zA-Z]*([a-zA-Z]+)[^,]*')
                 .groupby(level=0)[0].agg(','.join))
       .sort_values(df.columns.tolist())
       .groupby('Name').agg(','.join).add_suffix(' Cluster')
       .reset_index()
       .groupby(['Subset Cluster', 'Type Cluster'], as_index=False).agg(','.join)
      )

Output:

  Subset Cluster Type Cluster     Name System Cluster
0       IM,IM,IT     LP,OP,OP  B03,D09    A,B,A,B,A,B
1          IT,IU        PP,OP  A00,B01        A,A,B,B