1

I am trying to build a network where my edges consist of tuples. I am trying to group all related elements, but only based off of a single element in the tuple.

Similar to: Grouping all connected nodes of a dataset

Note: Pandas 23.4

Given the following dataframe:

  col1     col2     col1Name       col2Name
  'A'       'B'      '12345'        '78911'
  'C'       'B'      '12345'        '78911'
  'J'       'K'      '12345'        '12345'
  'E'       'D'      '12345'        '12345'

I am combining col1 and col1Name into a tuple, and doing the same with col2/col2Name.

      col1                col2    
  ('A','12345')       ('B','78911')   
  ('C','12345')       ('B','78911') 
  ('J','12345')       ('K','12345')
  ('E','12345')       ('D','12345')
  ('X','99999')       ('B','99999') 

From here I am trying to find all 'related' information, but not in regards to the second element of the tuple, only the first.

So if I were to group the information it would look like the following:

      col1                col2    
  ('A','12345')       ('B','78911')   
  ('C','12345')       ('B','78911') 
      col1                col2
  ('J','12345')       ('K','12345')
      col1                col2
  ('E','12345')       ('D','12345')
      col1                col2
  ('X','99999')       ('B','99999') 

Notice the groupings don't take into account col1Name/col2Name what-so-ever. That information only exists to give the elements in col1/col2 more 'uniqueness'. Also worth mentioning, it is possible to have multiple A,B,C,etc. In my example ('B','78911') is not the same as ('B','99999')

My thinking (from reference link):

G.add_edges_from(df.values.tolist())
cc = list(nx.connected_components(G))
component = next(i for i in cc if ('A') in i)
test = df[df.isin(component).all(1)]

This returns all of the groupings related to 'A' but also the groupings related to '12345', '78911'. I am only attempting to group on the first element.

MaxB
  • 428
  • 1
  • 8
  • 24
  • 1
    I don't understand how you group your tuples. Why isn't the last row ('X','99999') ('B','99999') not grouped together ? – vlemaistre Jun 24 '19 at 12:29
  • @vlemaistre `('X','99999')` and `('B','99999')` are grouped together. However, `('B','99999')` is not in the same group as `('B','78911')` – MaxB Jun 24 '19 at 12:33

1 Answers1

0

You wrote:

Notice the groupings don't take into account col1Name/col2Name what-so-ever.

and:

However, ('B','99999') is not in the same group as ('B','78911')

This is contradicting. How are they different if you "don't take into account col1Name/col2Name what-so-ever."? Also "X" is missing in your dataframe.

So what are you grouping together? You wrote:

('X','99999') and ('B','99999') are grouped together

But those are just two tuples in the same row. In your initial post you wrote you are grouping by equal values in col1 or equal values in col2 over all rows. So what is it? And how does the data in df look like? I can't reproduce your example code. Try to explain more precise what you want to do.

Given on the contradicting and missing information I quess you try to: "Group rows together that have equal values in col1 or col2."

If you got your data only in tuples (as you wrote) you lose the information of rows, so I don't think that is what you meant.

Since you are describing a network with edges and (as you wrote) your col1Name and col2Name columns are to ignore for the grouping, you have to give the right tuples to the connected_components() method. Looking like this:

l = [('A', 'B'),('C', 'B'),('J', 'K'),('E','D'),('X','B')]

G=nx.Graph()
G.add_edges_from(l)
cc = list(nx.connected_components(G))
component = next(i for i in cc if ('A') in i)

#{'B', 'X', 'C', 'A'}

for x in l:
    if x[0] in component or x[1] in component:
        print (x)

>>> ('A', 'B')
>>> ('C', 'B')
>>> ('X', 'B')

The networkX connected_component() method groups all tuples together that have one equal value. If you want to use this method you have to give it the right data.

Jim Panse
  • 606
  • 1
  • 4
  • 12