2

I have two data frames with different column names, there are 10 rows each. What I'm trying to do is compare the column values and if they match copy the email address from df2 to df1. I've looked at this example but my column names are different How to join (merge) data frames (inner, outer, left, right)?. I've seen this example as well of np.where where more than one condition is used but when i do that it gives me the following error:

ValueError: Wrong number of items passed 2, placement implies 1

What I want to do:

what I want to do is compare the first row 2 columns (first, last_huge) of df1 with all rows of df2 column (first_small, last_small) if the match is found get the email address from that particular column in df2 and assign it to a new column in df1. Can anyone please help me with this I've only copied the relevant code below and its not fully working just adding 5 new records to new_email.

Initially what i did is compared df1['first'] with df2['first']

data1 = {"first":["alice", "bob", "carol"],
         "last_huge":["foo", "bar", "baz"],
         "street_huge": ["Jaifo Road", "Wetib Ridge", "Ucagi View"],
         "city_huge": ["Egviniw", "Manbaali", "Ismazdan"],
         "age_huge": ["23", "30", "36"],
         "state_huge": ["MA", "LA", "CA"],
         "zip_huge": ["89899", "78788", "58999"]}

df1 = pd.DataFrame(data1)

data2 = {"first_small":["alice", "bob", "carol"],
         "last_small":["foo", "bar", "baz"],
         "street_small": ["Jsdffo Road", "sdf Ridge", "sdfff View"],
         "city_huge": ["paris", "london", "rome"],
         "age_huge": ["28", "40", "56"],
         "state_huge": ["GA", "EA", "BA"],
         "zip_huge": ["89859", "78728", "56999"],
         "email_small":["alice@xyz.com", "bob@abc.com", "carol@jkl.com"],
         "dob": ["31051989", "31051980", "31051981"],
         "country": ["UK", "US", "IT"],
         "company": ["microsoft", "apple", "google"],
         "source": ["bing", "yahoo", "google"]}

df2 = pd.DataFrame(data2)

df1['new_email'] = np.where((df1[['first']] == df2[['first_small']]), df2[['email_small']], np.nan)

Now it is only adding 5 records to the new_email and rest of them are nan. and showing me this error:

ValueError: Can only compare identically-labeled Series objects
Wcan
  • 840
  • 1
  • 10
  • 31

2 Answers2

2

Try merge:

(df1.merge(df2[["first_small", "last_small", "email_small"]], 
           how="left", 
           left_on=["first", "last_huge"], 
           right_on=["first_small", "last_small"])
    .drop(['first_small','last_small'], 1))

Example:

data1 = {"first":["alice", "bob", "carol"], 
         "last_huge":["foo", "bar", "baz"]}
df1 = pd.DataFrame(data1)

data2 = {"first_small":["alice", "bob", "carol"], 
         "last_small":["foo", "bar", "baz"],
         "email_small":["alice@xyz.com", "bob@abc.com", "carol@jkl.com"]}
df2 = pd.DataFrame(data2)

(df1.merge(df2[["first_small", "last_small", "email_small"]], 
           how="left", 
           left_on=["first", "last_huge"], 
           right_on=["first_small", "last_small"])
    .drop(['first_small','last_small'], 1))

Output:

   first last_huge    email_small
0  alice       foo  alice@xyz.com
1    bob       bar    bob@abc.com
2  carol       baz  carol@jkl.com
andrew_reece
  • 20,390
  • 3
  • 33
  • 58
  • You're welcome! If this solution resolved your issue, please mark it accepted by clicking the checkmark next to the answer. – andrew_reece Oct 15 '17 at 21:13
  • one question, if the chunk size in rows of df1 is different than df2, it won't be a problem right ? – Wcan Oct 15 '17 at 22:26
  • As long as each first-last name pair only has one email address in `df2`, then it shouldn't matter if `df1` and `df2` are of different lengths. (I think that's what you mean by chunk size.) If some rows in `df1` don't have a matching email in `df2`, the `email_small` field will show up as `NaN`. – andrew_reece Oct 15 '17 at 22:33
  • there is a small problem, in the actual file the column name of data1 "first" is "first_name" and column name of data2 "first_small" is also "first_name". the drop statement is actually dropping both the columns from data1 and data2, how do i prevent it from drop the column from data1 and drop it from data2 only ?? – Wcan Oct 24 '17 at 22:02
  • If you want to keep `first_name`, just include `first_name` in the `on=[]` list, and remove it from the `drop` list. The column will not be duplicated. – andrew_reece Oct 24 '17 at 22:09
  • Ok, its working Thankyou so much, but what the logic behind it ? I removed "first_name" from drop statement, and didn't delete from left_on and right_on and it is getting only column from left_on, why not right on ? – Wcan Oct 24 '17 at 22:37
  • You're welcome. When you do a left merge (`how='left'`), you're basically keeping everything in `left` for `left.merge(right, ...)` and adding data from `right` where there's a match for the `on` variables. If there are `first_name` values in `right` that don't match `first_name` in `left`, they will be dropped in the merged output. – andrew_reece Oct 24 '17 at 22:45
  • Awesome, Thank you really appreciate it. – Wcan Oct 24 '17 at 22:49
2

By using andrew_reece's example data :-) pd.concat

pd.concat([df1.set_index(["first", "last_huge"]),df2.set_index(["first_small", "last_small"])['email_small']],axis=1).reset_index().dropna()
Out[23]: 
   first last_huge    email_small
0  alice       foo  alice@xyz.com
1    bob       bar    bob@abc.com
2  carol       baz  carol@jkl.com

By using your data

pd.concat([df1.set_index(["first", "last_huge"]),df2.set_index(["first_small", "last_small"])['email_small']],axis=1).reset_index()
Out[97]: 
   first last_huge age_huge city_huge state_huge  street_huge zip_huge  \
0  alice       foo       23   Egviniw         MA   Jaifo Road    89899   
1    bob       bar       30  Manbaali         LA  Wetib Ridge    78788   
2  carol       baz       36  Ismazdan         CA   Ucagi View    58999   
     email_small  
0  alice@xyz.com  
1    bob@abc.com  
2  carol@jkl.com  

Updated by using map

df1['email_small']=(df1['first']+df1['last_huge']).map(df2.set_index(df2['first_small']+df2['last_small'])['email_small'])
df1
Out[115]: 
  age_huge city_huge  first last_huge state_huge  street_huge zip_huge  \
0       23   Egviniw  alice       foo         MA   Jaifo Road    89899   
1       30  Manbaali    bob       bar         LA  Wetib Ridge    78788   
2       36  Ismazdan  carol       baz         CA   Ucagi View    58999   
     email_small  
0  alice@xyz.com  
1    bob@abc.com  
2  carol@jkl.com  
BENY
  • 317,841
  • 20
  • 164
  • 234