I am trying to build a network where my edges consist of tuples. I am trying to group all related elements, but only based off of a single element in the tuple.
Similar to: Grouping all connected nodes of a dataset
Note: Pandas 23.4
Given the following dataframe:
col1 col2 col1Name col2Name
'A' 'B' '12345' '78911'
'C' 'B' '12345' '78911'
'J' 'K' '12345' '12345'
'E' 'D' '12345' '12345'
I am combining col1 and col1Name into a tuple, and doing the same with col2/col2Name.
col1 col2
('A','12345') ('B','78911')
('C','12345') ('B','78911')
('J','12345') ('K','12345')
('E','12345') ('D','12345')
('X','99999') ('B','99999')
From here I am trying to find all 'related' information, but not in regards to the second element of the tuple, only the first.
So if I were to group the information it would look like the following:
col1 col2
('A','12345') ('B','78911')
('C','12345') ('B','78911')
col1 col2
('J','12345') ('K','12345')
col1 col2
('E','12345') ('D','12345')
col1 col2
('X','99999') ('B','99999')
Notice the groupings don't take into account col1Name/col2Name what-so-ever. That information only exists to give the elements in col1/col2 more 'uniqueness'. Also worth mentioning, it is possible to have multiple A,B,C,etc. In my example ('B','78911')
is not the same as ('B','99999')
My thinking (from reference link):
G.add_edges_from(df.values.tolist())
cc = list(nx.connected_components(G))
component = next(i for i in cc if ('A') in i)
test = df[df.isin(component).all(1)]
This returns all of the groupings related to 'A' but also the groupings related to '12345', '78911'. I am only attempting to group on the first element.