Comparing two DataFrames in Pandas

Question

hello people (I am new to python) Question: how can I compare two DataFrames that have the same index but different formats.

Here are two examples of the DataFrames:

datum_von   datum_bis   NEG_00_04                     
2017-10-12  2017-10-12      21.69
2017-10-13  2017-10-13      11.85
2017-10-14  2017-10-14       7.83
2017-10-15  2017-10-15      14.64
2017-10-16  2017-10-16       5.11
2017-10-17  2017-10-17      12.09
2017-10-18  2017-10-18       8.47
2017-10-19  2017-10-19       6.34
2017-10-20  2017-10-20       7.68
2017-10-21  2017-10-21      13.40
2017-10-22  2017-10-22      25.53


datum_von   NEG_00_04  NEG_04_08  NEG_08_12  NEG_12_16  NEG_16_20  NEG_20_24                                                                 
2017-10-12      21.69      15.36       0.87       1.42       0.76       0.65   
2017-10-13      11.85       8.08       1.39       2.86       1.02       0.55   
2017-10-14       7.83       5.88       1.87       2.04       2.29       2.18   
2017-10-15      14.64      11.28       2.62       3.35       2.13       1.25   
2017-10-16       5.11       5.82       0.00       0.00       0.00       0.00   
2017-10-17      12.09       9.61       0.20       1.09       0.39       0.57   
2017-10-18       8.47       7.39       0.24       3.51       0.48       0.61

Note that they have the same index but not the same name for columns (I tried to use d.columns = list_of_names also did not worked it )

they look different but they are the same (contain the same information) but I did not copy all the data.I would like to know which values are different and how much is the difference

desired output: I would like to have the difference of the two date frame values cell by cell I mean cell[0][0] from d1 - cell[0][0] from df2 all the results in a new date frame

ok: I organize and now they have somehow the same organization

dfres = res0.pivot(index='datum_von', columns='produktname')['max']

Sorry, I am lost - do you want comapre one column which all columns? Arr all same columns from one dataframe with all same columns with another? — jezrael, Oct 27 '17 at 11:22
@jezrael oki i tried again so sorry for my bad question style — May Emanuel, Oct 27 '17 at 11:33
OK, so each dataframe have different index values and different columns? There is multiple columns in `df1`? And want compare only same rows in both dataframes and same columns ? — jezrael, Oct 27 '17 at 11:45
yes, I guess I am close to find the way. I subtracted and now I would like to identify which datum_von and product the difference has a negative value. I am not sure how to do that though — May Emanuel, Oct 27 '17 at 11:49

score 1 · Accepted Answer · edited Oct 30 '17 at 08:29

1

Solution used

Better is use:

dfres = res0.pivot(index='datum_von', columns='produktname', values='max')
dfdiff=dfres.sub(finaldf)

And then:

mask = dfdiff < 0
df = dfdiff.loc[mask.any(axis=1), mask.any()]]

Also is possible filter original DataFrames, if same columns and index values as dfdiff:

dfa = dfres.loc[mask.any(axis=1), mask.any()]]
dfb = finaldf.loc[mask.any(axis=1), mask.any()]]

edited Oct 30 '17 at 08:29

May Emanuel

133
8

answered Oct 27 '17 at 10:56

jezrael

822,522
95
1,334
1,252

GREAT solution! I guess this is the function I was looking for – May Emanuel Oct 27 '17 at 12:00
Not sure if understand - what is `product` ? some column? level of Multiindex ? – jezrael Oct 27 '17 at 12:02
I get values positive also not only the negatives with the mask. How do I know which ones are higher than zero then? – May Emanuel Oct 27 '17 at 12:32
I dont see your data, so hard answer. Do you think why dont get all data in ouput only negative? If yes, it is because working with arrays. But not sure if understan you, sorry. Maybe the best is add some sample data and desired output or explain problem with some data. – jezrael Oct 27 '17 at 12:41
you are so right! you solution worked perfectly!!! thank you so much learning about mask right now – May Emanuel Oct 27 '17 at 13:08
You are welcome! I have one question - I am really curious about [this](https://stackoverflow.com/q/46970671/2901002) question - why my solution is not good? It not working for you? Because I think my solution is best here - the fastes, the easiest, the more readable. So i think there have to be some reason why you use `applymap` solution and I dont know it. Can you explain it? thanks. – jezrael Oct 27 '17 at 13:12
first of all, sorry for my late reply.Yes, your solution is the best as is the fastest and the easiest. I only wrote the other in my solution as I was learning about applymap so to get used. But I will update also posting your solution as reasonable. **Thank you so much** – May Emanuel Oct 30 '17 at 08:24

score 0 · Answer 2 · answered Oct 27 '17 at 10:59

I agree that you should be more specific about your desired output, however it most likely go along these lines

 cols_overlap = [ c for c in df1.columns if c in df2.columns]
 dfn = df1.merge(df2, how='inner', left_index=True, right_index=True, suffixes=('_df1','_df2'))

 for c in cols_overlap:
      print('%s' % c)
      print((dfm[c+'_df1']==dfm[c+'_df2'] ).all())

score 0 · Answer 3 · answered Oct 27 '17 at 11:47

0

I used this to organize my data frame to be in the same format

dfres = res0.pivot(index='datum_von', columns='produktname')['max']

Then I used this to compute the difference

dfdiff=dfres.subtract(finaldf)

answered Oct 27 '17 at 11:47

May Emanuel

133
8

Comparing two DataFrames in Pandas

3 Answers3