2

I would like to compare two Pandas DataFrames and get the indices of the differences.

import numpy as np
import pandas as pd

rng = pd.date_range('2019-03-04', periods=5)
cols = ['A', 'B', 'C', 'D']

df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=rng, columns=cols)
df2 = pd.DataFrame(np.arange(20).reshape(5, 4), index=rng, columns=cols)

df2.iloc[2, 2] = 100
df2.iloc[3, 1] = 50

df1.equals(df2)  # OK, good to know, but where is the difference?
df1 == df2  # Nice, too. But I'm interested in the indices!

# I need a list containing [(2,2), (3,1)]. Even more intuitive would be something like [('2019-03-06', 'C'), ('2019-03-07', 'B')]

EDIT: I don't necessarily need a list, but something to identify the indices. That is, if there is a simple and intuitive way to solve that issue without a list, that's fine. However, a list will also be OK.

Andi
  • 3,196
  • 2
  • 24
  • 44

2 Answers2

2

Does this work:

np.array(np.nonzero(df1.ne(df2).values)).transpose()

output:

array([[2, 2],
   [3, 1]], dtype=int64)

Another way:

df1.mask(df1.eq(df2)).stack().index.values

Output:

array([(2, 2), (3, 1)], dtype=object)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

I think you can just use np.where like below

r, c = np.where(df1 != df2)
list(zip(r,c))

Which returns

[(2, 2), (3, 1)]

Edit

The above will not work if dataframes have different type of index, in that case the numpy array should be compared instead

 r, c = np.where(df1.values != df2.values)
stahamtan
  • 848
  • 6
  • 10
  • 1
    You are assuming the index is from 0 to n , sometime we have different type index , like ['a','b'.......'x'] – BENY Sep 24 '19 at 18:55
  • This already looks like a reasonable solution. Can we transform the result into a list of labeled indices (see my edited question)? Basically moving from `iloc` to `loc`. – Andi Sep 24 '19 at 19:28
  • You can retrieve index and column names `list(zip(df1.index[r], df1.columns[c]))` which returns `[(Timestamp('2019-03-06 00:00:00'), 'C'), (Timestamp('2019-03-07 00:00:00'), 'B')]` – stahamtan Sep 24 '19 at 19:52
  • 1
    You can also use `r, c = np.nonzero(df1.values != df2.values)`. According to the docs (https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html), using np.nonzero is the preferred way. Anyway, I like that solution because from here I can easily retrieve the index and column names as @stahamtan pointed out correctly. – Andi Sep 24 '19 at 20:51