Compare columns in two different data frames if match found copy email from df2 to df1

Question

I have two data frames with different column names, there are 10 rows each. What I'm trying to do is compare the column values and if they match copy the email address from df2 to df1. I've looked at this example but my column names are different How to join (merge) data frames (inner, outer, left, right)?. I've seen this example as well of np.where where more than one condition is used but when i do that it gives me the following error:

ValueError: Wrong number of items passed 2, placement implies 1

What I want to do:

what I want to do is compare the first row 2 columns (first, last_huge) of df1 with all rows of df2 column (first_small, last_small) if the match is found get the email address from that particular column in df2 and assign it to a new column in df1. Can anyone please help me with this I've only copied the relevant code below and its not fully working just adding 5 new records to new_email.

Initially what i did is compared df1['first'] with df2['first']

data1 = {"first":["alice", "bob", "carol"],
         "last_huge":["foo", "bar", "baz"],
         "street_huge": ["Jaifo Road", "Wetib Ridge", "Ucagi View"],
         "city_huge": ["Egviniw", "Manbaali", "Ismazdan"],
         "age_huge": ["23", "30", "36"],
         "state_huge": ["MA", "LA", "CA"],
         "zip_huge": ["89899", "78788", "58999"]}

df1 = pd.DataFrame(data1)

data2 = {"first_small":["alice", "bob", "carol"],
         "last_small":["foo", "bar", "baz"],
         "street_small": ["Jsdffo Road", "sdf Ridge", "sdfff View"],
         "city_huge": ["paris", "london", "rome"],
         "age_huge": ["28", "40", "56"],
         "state_huge": ["GA", "EA", "BA"],
         "zip_huge": ["89859", "78728", "56999"],
         "email_small":["alice@xyz.com", "bob@abc.com", "carol@jkl.com"],
         "dob": ["31051989", "31051980", "31051981"],
         "country": ["UK", "US", "IT"],
         "company": ["microsoft", "apple", "google"],
         "source": ["bing", "yahoo", "google"]}

df2 = pd.DataFrame(data2)

df1['new_email'] = np.where((df1[['first']] == df2[['first_small']]), df2[['email_small']], np.nan)

Now it is only adding 5 records to the new_email and rest of them are nan. and showing me this error:

ValueError: Can only compare identically-labeled Series objects

Please don't post screenshots of data when you can provide representative example data inline. It makes it a lot easier for others to help you. — andrew_reece, Oct 15 '17 at 20:54

andrew_reece · Accepted Answer · 2017-10-15T21:03:23.130

2

Try merge:

(df1.merge(df2[["first_small", "last_small", "email_small"]], 
           how="left", 
           left_on=["first", "last_huge"], 
           right_on=["first_small", "last_small"])
    .drop(['first_small','last_small'], 1))

Example:

data1 = {"first":["alice", "bob", "carol"], 
         "last_huge":["foo", "bar", "baz"]}
df1 = pd.DataFrame(data1)

data2 = {"first_small":["alice", "bob", "carol"], 
         "last_small":["foo", "bar", "baz"],
         "email_small":["alice@xyz.com", "bob@abc.com", "carol@jkl.com"]}
df2 = pd.DataFrame(data2)

(df1.merge(df2[["first_small", "last_small", "email_small"]], 
           how="left", 
           left_on=["first", "last_huge"], 
           right_on=["first_small", "last_small"])
    .drop(['first_small','last_small'], 1))

Output:

   first last_huge    email_small
0  alice       foo  alice@xyz.com
1    bob       bar    bob@abc.com
2  carol       baz  carol@jkl.com

edited Oct 15 '17 at 21:03

answered Oct 15 '17 at 20:57

andrew_reece

20,390
3
33
58

You're welcome! If this solution resolved your issue, please mark it accepted by clicking the checkmark next to the answer. – andrew_reece Oct 15 '17 at 21:13
one question, if the chunk size in rows of df1 is different than df2, it won't be a problem right ? – Wcan Oct 15 '17 at 22:26
As long as each first-last name pair only has one email address in `df2`, then it shouldn't matter if `df1` and `df2` are of different lengths. (I think that's what you mean by chunk size.) If some rows in `df1` don't have a matching email in `df2`, the `email_small` field will show up as `NaN`. – andrew_reece Oct 15 '17 at 22:33
there is a small problem, in the actual file the column name of data1 "first" is "first_name" and column name of data2 "first_small" is also "first_name". the drop statement is actually dropping both the columns from data1 and data2, how do i prevent it from drop the column from data1 and drop it from data2 only ?? – Wcan Oct 24 '17 at 22:02
If you want to keep `first_name`, just include `first_name` in the `on=[]` list, and remove it from the `drop` list. The column will not be duplicated. – andrew_reece Oct 24 '17 at 22:09
Ok, its working Thankyou so much, but what the logic behind it ? I removed "first_name" from drop statement, and didn't delete from left_on and right_on and it is getting only column from left_on, why not right on ? – Wcan Oct 24 '17 at 22:37
You're welcome. When you do a left merge (`how='left'`), you're basically keeping everything in `left` for `left.merge(right, ...)` and adding data from `right` where there's a match for the `on` variables. If there are `first_name` values in `right` that don't match `first_name` in `left`, they will be dropped in the merged output. – andrew_reece Oct 24 '17 at 22:45
Awesome, Thank you really appreciate it. – Wcan Oct 24 '17 at 22:49

BENY · Answer 2 · 2017-10-15T22:03:31.637

2

By using andrew_reece's example data :-) pd.concat

pd.concat([df1.set_index(["first", "last_huge"]),df2.set_index(["first_small", "last_small"])['email_small']],axis=1).reset_index().dropna()
Out[23]: 
   first last_huge    email_small
0  alice       foo  alice@xyz.com
1    bob       bar    bob@abc.com
2  carol       baz  carol@jkl.com

By using your data

pd.concat([df1.set_index(["first", "last_huge"]),df2.set_index(["first_small", "last_small"])['email_small']],axis=1).reset_index()
Out[97]: 
   first last_huge age_huge city_huge state_huge  street_huge zip_huge  \
0  alice       foo       23   Egviniw         MA   Jaifo Road    89899   
1    bob       bar       30  Manbaali         LA  Wetib Ridge    78788   
2  carol       baz       36  Ismazdan         CA   Ucagi View    58999   
     email_small  
0  alice@xyz.com  
1    bob@abc.com  
2  carol@jkl.com

Updated by using map

df1['email_small']=(df1['first']+df1['last_huge']).map(df2.set_index(df2['first_small']+df2['last_small'])['email_small'])
df1
Out[115]: 
  age_huge city_huge  first last_huge state_huge  street_huge zip_huge  \
0       23   Egviniw  alice       foo         MA   Jaifo Road    89899   
1       30  Manbaali    bob       bar         LA  Wetib Ridge    78788   
2       36  Ismazdan  carol       baz         CA   Ucagi View    58999   
     email_small  
0  alice@xyz.com  
1    bob@abc.com  
2  carol@jkl.com

edited Oct 15 '17 at 22:03

answered Oct 15 '17 at 21:08

BENY

317,841
20
164
234

Moving the common fields to index and then `concat` is a neat trick. – andrew_reece Oct 15 '17 at 21:12
@andrew_reece the only limit here is the key is not unique. – BENY Oct 15 '17 at 21:14
Thank you! but first and last_huge columns are rename to level_0 and level_1 also other columns are all NaN except level_0, level_1 and email_small – Wcan Oct 15 '17 at 21:28
@Wcan is your data same with the sample data ? – BENY Oct 15 '17 at 21:30
my apologies, sample data given above data wasn't the same but i changed it. – Wcan Oct 15 '17 at 21:44
@Wcan by using your updated sample data , this seems still work – BENY Oct 15 '17 at 21:46
@Wcan updated , you can try it now adding `dropna()` in the end – BENY Oct 15 '17 at 21:50
the data is back but (1) the title are named as level_0 and level_1 (2) the non-matching rows are gone – Wcan Oct 15 '17 at 21:59

Compare columns in two different data frames if match found copy email from df2 to df1

2 Answers2