2

I am working within Microsoft Azure Databricks with two Dataframes.

I have already a Dataframe which contains my "Masterdata". I am receiving daily also a full data package with "all" records. But those within this Dataframe there can be changes within records and records could be also deleted and added.

What is the best and maybe easiest way to get this delta or changset of data between two Dataframes?

UPDATE DataFrame 1 -> which i am getting every day

customer  score
MERCEDES  1.1
CHRYSLER  3.0

DataFrame 2 -> my master

customer score
BMW       1.1
MERCEDES  1.3

So what do i need to get:

customer score
BMW       1.1    -> cause was deleted in receiving data
MERCEDES  1.3    -> cause was value changed
CHRYSLER  3.0    -> cause was added new
STORM
  • 4,005
  • 11
  • 49
  • 98
  • have you tried the merge function? – MEdwin Nov 07 '18 at 10:21
  • Thanks for your comment, but i dont want to merge both dataframes instead i want to compare and extract the differences. – STORM Nov 07 '18 at 10:24
  • Okay, see if you can put together some sample dataframes to reflect the type of data you are getting. it will give us an idea what you are trying to achieve. – MEdwin Nov 07 '18 at 10:26

3 Answers3

1

here is the merge function. See if it works for you.

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'customer':['MERCEDES','CHRYSLER'], 'score':[1.1, 3.0]})
df2 = pd.DataFrame({'customer':['BMW','MERCEDES'], 'score':[1.1, 1.3]})

df = pd.merge(df1, df2, on=['customer'], how='outer',indicator=True)
df

see the result:

enter image description here

MEdwin
  • 2,940
  • 1
  • 14
  • 27
0

A solution is to append a received date column 'RDate' to each of the dataframe. Then concatenate the two dataframes. using sort by you then pick the latest value.

import numpy as np
import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'customer':['MERCEDES','CHRYSLER'], 'score':[1.1, 3.0]})
df2 = pd.DataFrame({'customer':['BMW','MERCEDES'], 'score':[1.1, 1.3]})
df2['RDate']='2018-11-01'
df1['RDate']='2018-11-07'

mdf = pd.concat([df1, df2])
mdfs =mdf.sort_values(by='RDate')
mdfs.drop_duplicates('customer', keep='last')

original concatenated dataframes:

enter image description here

final (duplicates dropped):

enter image description here

MEdwin
  • 2,940
  • 1
  • 14
  • 27
  • MEdwin thanks for the work! But this is the final result of the data. What i need is to really get the changes, added and deleted as separeted Dataframes so that i can work with it. Is this also possible? – STORM Nov 07 '18 at 12:34
  • okay, I think what you need is just the merge function. It will let you know if the score is contained in both the new and old dataframe, and right_only means it is a new entry. – MEdwin Nov 07 '18 at 12:50
0

You can also try like:

>>> df1
   customer score
0  MERCEDES   1.1
1  CHRYSLER   3.0

>>> df2
   customer score
0       BMW   1.1
1  MERCEDES   1.3

Concatenate the df1 & df2 on the columns..

>>> df = pd.concat([df1,df2], axis=0)

Now you can use groupby with customer as that's the Key for update with last(offset) method which is Convenient for for subsetting final periods of time series data

>>> df.groupby(['customer'], as_index=False).last()
   customer score
0       BMW   1.1
1  CHRYSLER   3.0
2  MERCEDES   1.3

as_index=False is effectively "SQL-style" grouped output while using with groupby

Karn Kumar
  • 8,518
  • 3
  • 27
  • 53