How to avoid Cartesian while merging in Pandas Python

Question

I am trying to merge 2 datasets X and Y. Dataset X has Joining Key column which has duplicate values. Dataset Y has the Joining key column and one additional column. The dataset images have been uploaded below. The problem is that I want to avoid Cartesian Product due to the duplicates present in dataset X. I have attached the Resultant dataset image below. This could be manually done by manually merging using a for loop, but it is time consuming. Anyone can provide a better method

All DataSets Image

The more I read your question the less I understand what you are asking. You want to merge two `DataFrames`? Why should it use a Cartesian product to do that? See here [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) how merging works. No product is involved by default. — Shintlor, Sep 11 '18 at 14:34
By Cartesian product I mean, **all** the EMM_ID with value 610462 will get the ID_Home value of 80100. Instead, I want **only the first** EMM_ID 610462 to get the value 80100 and rest EMM_ID 610462 should be NaN. I hope I am making sense now, Its tough to explain. Thank you — Anunay Sanganal, Sep 11 '18 at 17:58

score 2 · Accepted Answer · answered Sep 11 '18 at 15:49

Using @Alollz setup:

df_x = pd.DataFrame({'EMM_ID': [610462, 610462, 610462, 610462, 61000, 61000],
                     'ID_home': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df_y = pd.DataFrame({'EMM_ID': [610462, 61000], 'ID_home': [81000, 18]})

You could create a new 'key' to join on with cumcount.

colkey = 'EMM_ID'
df_x = df_x.assign(colkey=df_x.groupby(colkey).cumcount())
df_y = df_y.assign(colkey=df_y.groupby(colkey).cumcount())

df_x[['EMM_ID','colkey']].merge(df_y, on=['EMM_ID','colkey'], how='left')

Output:

   EMM_ID  colkey  ID_home
0  610462       0  81000.0
1  610462       1      NaN
2  610462       2      NaN
3  610462       3      NaN
4   61000       0     18.0
5   61000       1      NaN

ALollz · Answer 2 · 2018-09-11T15:57:26.337

In this case, since you just need to bring one column, .map is probably more suitable. We take the first value within each EMM_ID group and only map that value. Alignment on index ensures the rest become NaN.

Sample Data

import pandas as pd
import numpy as np
df_x = pd.DataFrame({'EMM_ID': [610462, 610462, 610462, 610462, 61000, 61000],
                     'ID_home': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df_y = pd.DataFrame({'EMM_ID': [610462, 61000], 'ID_home': [81000, 18]})

Code

df_x['ID_home'] = df_x.groupby('EMM_ID').head(1).EMM_ID.map(df_y.set_index('EMM_ID').ID_home)

Output: `df_x`

   EMM_ID  ID_home
0  610462  81000.0
1  610462      NaN
2  610462      NaN
3  610462      NaN
4   61000     18.0
5   61000      NaN

If you need to bring multiple columns, then you could split your DataFrame, merge with the subset, and then concatenate back to one DataFrame.

df_x = pd.DataFrame({'EMM_ID': [610462,610462,610462,610462, 61000, 61000],
                     'ID_home': [np.NaN, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]})
df_y = pd.DataFrame({'EMM_ID': [610462, 61000], 'ID_home': [81000, 18], 'Val_2': ['A', 'F']})

to_merge = df_x.groupby('EMM_ID').head(1)
keep = df_x[~df_x.index.isin(to_merge.index)]

pd.concat([keep, to_merge[['EMM_ID']].merge(df_y)], sort=False).sort_index()

Output:

   EMM_ID  ID_home Val_2
0  610462  81000.0     A
1  610462      NaN   NaN
1   61000     18.0     F
2  610462      NaN   NaN
3  610462      NaN   NaN
5   61000      NaN   NaN

How to avoid Cartesian while merging in Pandas Python

2 Answers2

Sample Data

Code

Output: df_x

Output: `df_x`