5

I've two dataframes in pandas as shown below. EmpID is a primary key in both dataframes.

df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])

I want to join these two dataframes with EmpID so that

  1. Missing data in one dataframe can be filled with value from another table if exists and key matches
  2. If there are observations with new keys then they should be appended in the resulting dataframe

I've used below code for achieving this.

merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

But this code gives me duplicate columns which I don't want so I only used unique columns from both tables for merging.

ColNames = list(df_second.columns.difference(df_first.columns))
ColNames.append('EmpID')
merged_df = pd.merge(df_first,df_second,how='outer',on=['EmpID'])

Now I don't get duplicate columns but don't get value either in observations where key matches.

I'll really appreciate if someone can help me with this.

Regards, Kailash Negi

Kailash
  • 167
  • 1
  • 2
  • 12

1 Answers1

7

It seems you need combine_first with set_index for match by indices created by columns EmpID:

df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
   EmpID   Department  Location Name  Salary
0      1           HR     Delhi    A  1000.0
1      2          NaN       NaN    B     NaN
2      3      Finance       NaN    C  3000.0
3      4          NaN       NaN    D  8000.0
4      5  Programming       NaN    E  6000.0
5      8        Admin    Mumbai    B     NaN
6      9          Ops  Banglore    D     NaN
7     10    Analytics    Mumbai    K     NaN

EDIT:

For some order of columns need reindex:

#concatenate all columns names togetehr and remove dupes
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')

df = (df_first.set_index('EmpID')
      .combine_first(df_second.set_index('EmpID'))
      .reset_index()
      .reindex(columns=ColNames))
print (df)
   EmpID Name   Department  Location  Salary
0      1    A           HR     Delhi  1000.0
1      2    B          NaN       NaN     NaN
2      3    C      Finance       NaN  3000.0
3      4    D          NaN       NaN  8000.0
4      5    E  Programming       NaN  6000.0
5      8    B        Admin    Mumbai     NaN
6      9    D          Ops  Banglore     NaN
7     10    K    Analytics    Mumbai     NaN
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This solution serves the purpose but just wanted to check if we can preserve the order of columns. So first we should have columns from first dataframe and then second dataframe. Also am still thinking if it can be done using outer join. – Kailash Nov 27 '17 at 07:01
  • Not sure if understand, why? Can you explain more? – jezrael Nov 27 '17 at 07:02
  • My real dataset has around 200 columns and they are set in a particular order. The dataset has been in this order only for long time now and performing this operation is resulting in sorted column order which people working on this dataset are not used to. So I was just thinking if there is a way we can preserve the columns order. – Kailash Nov 27 '17 at 07:07
  • I add solution for it. Need som way create list of all columns in order what you want, but avoid some functions like [`intersection`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.intersection.html) or [`union`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.union.html), because they sort values. And last reindex. – jezrael Nov 27 '17 at 07:24