Complete recursive transitive dependencies per row in pandas dataframe

Question

Here is a simple program that transitively maps sets of columns:

import pandas as pd

df1vals = [{'c1': '1', 'c2': "2"}]
df1 = pd.DataFrame(df1vals, columns = ['c1' , 'c2'])

df2vals = [{'c2': '2', 'c3': "100"}]
df2 = pd.DataFrame(df2vals, columns = ['c2' , 'c3'])


df3vals = [{'c3': '100', 'c4': "x"}]
df3 = pd.DataFrame(df3vals, columns = ['c3' , 'c4'])

df4vals = [{'c1': '1', 'c4': "m"}]
df4 = pd.DataFrame(df4vals, columns = ['c1' , 'c4'])

df5vals = [{'c2': '2', 'c4': "k"}]
df5 = pd.DataFrame(df5vals, columns = ['c2' , 'c4'])

dfs = [df1,df2, df3, df4, df5]

merged_df = dfs[0]
for df in dfs[1:]:
    common_cols = list(set(merged_df.columns) & set(df.columns))
    merged_df = pd.merge(merged_df, df, on=common_cols, how='outer')  

display(merged_df)

This works to produce this output:

    c1  c2  c3  c4
0   1   2   100 x
1   1   NaN NaN m
2   NaN 2   NaN k

This data is all good, but it's missing the fact that c1 1 is associated with c4 k through c2 2. So, I would also want this row:

1 2 NaN k

I would think that the same logic that maps 1 -> 2 -> 100 -> x would also work on this, but it does not. Why?

@mozway My ultimate goal is to always be able to answer the question "what other values of what other columns is c1 associated with?" but the even better version would be able to ask for any value of any column rather than c1. — Jeremy, Jul 05 '23 at 20:10

score 0 · Answer 1 · answered Jul 05 '23 at 21:51

I'm not completely sure, but I think my problem was that I was merging on None (e.g. all matching columns) when in fact I only wanted to merge if the left-most column matched. Therefore the right line is:

merged_df = pd.merge(merged_df, df, on=df.columns[0], how='outer')

mozway · Answer 2 · 2023-07-05T23:33:25.917

It looks to me like you need to compute a graph of the connected cells to find the connected components.

Let's use networkx for that. Here all cells are connected together:

from itertools import combinations
import networkx as nx

tmp = pd.concat(dfs)

G = nx.from_edgelist([((a, c), (b, d))
 for a, b in combinations(tmp, 2)
 for c, d in zip(tmp[a], tmp[b])
 if pd.notna(c) and pd.notna(d)
 ])

out = list(nx.connected_components(G))

Output:

[{('c4', 'k'), ('c4', 'x'), ('c1', '1'), ('c3', '100'), ('c4', 'm'), ('c2', '2')}]

Graph:

Now for the demo, let's change the value of c2 to 7 in the first dataframe. The output now shows two groups:

[{('c4', 'm'), ('c2', '7'), ('c1', '1')},
 {('c4', 'x'), ('c4', 'k'), ('c2', '2'), ('c3', '100')}]

Graph:

Complete recursive transitive dependencies per row in pandas dataframe

2 Answers2