Python: Compare two Pandas DataFrames and get indices of differences

Question

I would like to compare two Pandas DataFrames and get the indices of the differences.

import numpy as np
import pandas as pd

rng = pd.date_range('2019-03-04', periods=5)
cols = ['A', 'B', 'C', 'D']

df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=rng, columns=cols)
df2 = pd.DataFrame(np.arange(20).reshape(5, 4), index=rng, columns=cols)

df2.iloc[2, 2] = 100
df2.iloc[3, 1] = 50

df1.equals(df2)  # OK, good to know, but where is the difference?
df1 == df2  # Nice, too. But I'm interested in the indices!

# I need a list containing [(2,2), (3,1)]. Even more intuitive would be something like [('2019-03-06', 'C'), ('2019-03-07', 'B')]

EDIT: I don't necessarily need a list, but something to identify the indices. That is, if there is a simple and intuitive way to solve that issue without a list, that's fine. However, a list will also be OK.

score 2 · Answer 1 · answered Sep 24 '19 at 18:41

2

Does this work:

np.array(np.nonzero(df1.ne(df2).values)).transpose()

output:

array([[2, 2],
   [3, 1]], dtype=int64)

Another way:

df1.mask(df1.eq(df2)).stack().index.values

Output:

array([(2, 2), (3, 1)], dtype=object)

answered Sep 24 '19 at 18:41

Quang Hoang

146,074
10
56
74

BTW, could you please explain why downvote ? This answer do solve the question – BENY Sep 24 '19 at 19:05
You could also use `np.transpose(np.nonzero(df1.values != df2.values))` to get the same result. – Andi Sep 24 '19 at 20:48
Actually, the second method answers your question better, but is slower. – Quang Hoang Sep 24 '19 at 20:50

stahamtan · Accepted Answer · 2019-09-24T19:05:28.333

2

I think you can just use np.where like below

r, c = np.where(df1 != df2)
list(zip(r,c))

Which returns

[(2, 2), (3, 1)]

Edit

The above will not work if dataframes have different type of index, in that case the numpy array should be compared instead

 r, c = np.where(df1.values != df2.values)

edited Sep 24 '19 at 19:05

answered Sep 24 '19 at 18:44

stahamtan

848
6
10

1

You are assuming the index is from 0 to n , sometime we have different type index , like ['a','b'.......'x'] – BENY Sep 24 '19 at 18:55
This already looks like a reasonable solution. Can we transform the result into a list of labeled indices (see my edited question)? Basically moving from `iloc` to `loc`. – Andi Sep 24 '19 at 19:28
You can retrieve index and column names `list(zip(df1.index[r], df1.columns[c]))` which returns `[(Timestamp('2019-03-06 00:00:00'), 'C'), (Timestamp('2019-03-07 00:00:00'), 'B')]` – stahamtan Sep 24 '19 at 19:52
1

You can also use `r, c = np.nonzero(df1.values != df2.values)`. According to the docs (https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html), using np.nonzero is the preferred way. Anyway, I like that solution because from here I can easily retrieve the index and column names as @stahamtan pointed out correctly. – Andi Sep 24 '19 at 20:51

Python: Compare two Pandas DataFrames and get indices of differences

2 Answers2