I have a dataset
Name Subset Type System
A00 9-IU00-A OP A
A00 IT00 PP A
B01 IT-01A PP B
B01 IU OP B
B03 IM-09-B LP A
B03 IM03A OP B
B03 IT-09 OP A
D09 IT-A09 OP B
D09 07IM-09A LP B
D09 IM OP A
So here I need to group the Name column such that Subset and Type are similar.
We have to only consider the first alphabetical part of the subset column and ignore rest. for eg IM-09-B, IM03A can be considered as IM.
Subset
clusters are first alphabetical string withput any digit, hyphen, etc. extracted using
df['Subset'].str.extractall(r'[^a-zA-Z]*([a-zA-Z]+)[^,]*').groupby(level=0)[0].agg(','.join)})
Output needed
Subset Cluster Type Cluster Name System
IU,IT OP,PP A00,B01 A,A,B,B
IM,IM,IT LP, OP, OP B03, D09 A,B,A,B,B,A
Here the first cluster instance is formed because IU is OP and IT is PP in both cases, similar for the second instance. How can a sequential Pattern mining be used here.