0

Here is a simple program that transitively maps sets of columns:

import pandas as pd

df1vals = [{'c1': '1', 'c2': "2"}]
df1 = pd.DataFrame(df1vals, columns = ['c1' , 'c2'])

df2vals = [{'c2': '2', 'c3': "100"}]
df2 = pd.DataFrame(df2vals, columns = ['c2' , 'c3'])


df3vals = [{'c3': '100', 'c4': "x"}]
df3 = pd.DataFrame(df3vals, columns = ['c3' , 'c4'])

df4vals = [{'c1': '1', 'c4': "m"}]
df4 = pd.DataFrame(df4vals, columns = ['c1' , 'c4'])

df5vals = [{'c2': '2', 'c4': "k"}]
df5 = pd.DataFrame(df5vals, columns = ['c2' , 'c4'])

dfs = [df1,df2, df3, df4, df5]

merged_df = dfs[0]
for df in dfs[1:]:
    common_cols = list(set(merged_df.columns) & set(df.columns))
    merged_df = pd.merge(merged_df, df, on=common_cols, how='outer')  

display(merged_df)

This works to produce this output:

    c1  c2  c3  c4
0   1   2   100 x
1   1   NaN NaN m
2   NaN 2   NaN k

This data is all good, but it's missing the fact that c1 1 is associated with c4 k through c2 2. So, I would also want this row:

1 2 NaN k 

I would think that the same logic that maps 1 -> 2 -> 100 -> x would also work on this, but it does not. Why?

Jeremy
  • 5,365
  • 14
  • 51
  • 80
  • 1
    What is your ultimate goal? To identify the linked cells? – mozway Jul 05 '23 at 18:37
  • Why not a row for 1/2/100/NaN or 1/2/NaN/m? – mozway Jul 05 '23 at 18:44
  • @mozway My ultimate goal is to always be able to answer the question "what other values of what other columns is c1 associated with?" but the even better version would be able to ask for any value of any column rather than c1. – Jeremy Jul 05 '23 at 20:10

2 Answers2

0

I'm not completely sure, but I think my problem was that I was merging on None (e.g. all matching columns) when in fact I only wanted to merge if the left-most column matched. Therefore the right line is:

merged_df = pd.merge(merged_df, df, on=df.columns[0], how='outer')
Jeremy
  • 5,365
  • 14
  • 51
  • 80
0

It looks to me like you need to compute a graph of the connected cells to find the connected components.

Let's use networkx for that. Here all cells are connected together:

from itertools import combinations
import networkx as nx

tmp = pd.concat(dfs)

G = nx.from_edgelist([((a, c), (b, d))
 for a, b in combinations(tmp, 2)
 for c, d in zip(tmp[a], tmp[b])
 if pd.notna(c) and pd.notna(d)
 ])

out = list(nx.connected_components(G))

Output:

[{('c4', 'k'), ('c4', 'x'), ('c1', '1'), ('c3', '100'), ('c4', 'm'), ('c2', '2')}]

Graph:

enter image description here

Now for the demo, let's change the value of c2 to 7 in the first dataframe. The output now shows two groups:

[{('c4', 'm'), ('c2', '7'), ('c1', '1')},
 {('c4', 'x'), ('c4', 'k'), ('c2', '2'), ('c3', '100')}]

Graph:

enter image description here

mozway
  • 194,879
  • 13
  • 39
  • 75