2

I am trying to do a pairwise comparison for 100+ tables to find the rows in common by "ID". Is there a simple way to do such a pairwise comparison without going through a lot of loops?

The expected output is a matrix/heatmap showing the similarity between tables, with number of rows in common between the tables.

A sample of the tables I have is attached. They are all saved in the same directory at my machine.

Thank you,

Table 1 Table 2

Learner123
  • 87
  • 5

2 Answers2

1

If I understand correctly, you want the pairwise similarity of the ID columns only, regardless of whether the Value column has different values for that ID. Assuming your tables are represented as a list of dataframes called 'dfs':

m = np.zeros(shape=(len(dataframes), len(dataframes)))
for i, d in enumerate(dataframes):
    for j, d_ in enumerate(dataframes):
        commons = len(set(d.ID).intersection(set(d_.ID)))
        m[i][j] = commons
df = pd.DataFrame(m).applymap(int)
print(df)
SimonR
  • 1,774
  • 1
  • 4
  • 10
  • Thank you for your answer @SimonR. This worked like a charm for 2 dfs, but once I tried it with 3 dfs I got a traceback error on the liner inside the nested for loops " Traceback (most recent call last): File "", line 3, in IndexError: index 2 is out of bounds for axis 0 with size 2" – Learner123 May 19 '20 at 18:24
  • 1
    That's curious : I tested this with three dataframes and it worked fine. Shaping the numpy array using len(dfs) as in my answer should make sure you never get an IndexError. – SimonR May 19 '20 at 18:29
0

you can using numpy on this case I assume all data is dataframe like this:

id1 = ['A','B','C','D','E','F','G','H','I']
Value1 = np.random.random((len(id1),))
data1 = pd.DataFrame({'ID':id1,'Value':Value1})
id2 = ['O','P','A','C','R','T','U','L','M','B','E']
Value2 = np.random.random((len(id2),))
data2 = pd.DataFrame({'ID':id2,'Value':Value2})

and we will take value from data1, using numpy.in1d like this:

result1 = data1[np.in1d(data1['ID'],data2['ID'])]

or we will take value from data2, like this:

result2 = data2[np.in1d(data2['ID'],data1['ID'])]

and the result:

>>> result1
  ID     Value
0  A  0.213569
1  B  0.483985
2  C  0.888146
4  E  0.812130
>>> result2
   ID     Value
2   A  0.875728
3   C  0.690587
9   B  0.793642
10  E  0.679874
>>> 
Wahyu Hadinoto
  • 198
  • 1
  • 10