0

Assuming I've 2 data frames, df_a and df_b as below:

import pandas as pd

df_a

df_a = pd.DataFrame({"Letters": ['E', 'H', 'O', 'N', 'M', 'K', 'T', 'X'],
                     "Greek":['epsilon', 'eta', 'omicron', 'nu', 'mu', 
                              'kappa', 'tau', 'chi']})

     Greek Letters
0  epsilon       E
1      eta       H
2  omicron       O
3       nu       N
4       mu       M
5    kappa       K
6      tau       T
7      chi       X

df_b

df_b = pd.DataFrame({"Letters": ['Y', 'E', 'N', 'X', 'B']})
df_b["Greek"] = ""

  Letters Greek
0       Y      
1       E      
2       N      
3       X      
4       B      

I want to fill in column 'Greek' in df_b using corresponding values from df_a and I can do that using nested loops as follows:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]

df_b

Letters    Greek
0       Y         
1       E  epsilon
2       N       nu
3       X      chi
4       B     

First, I'm wondering if I can achieve the same result more efficiently using merge, zip, join or the other concatenation functions. Second, if I provide a matching else to the if statement, as follows:

else: df_b.loc[k, "Greek"] = float('nan')

I expected this would fill in the empty cells with NaN and not modify the other cells. Instead, I get something like:

df_b

  Letters Greek
0       Y   NaN
1       E   NaN
2       N   NaN
3       X   chi
4       B   NaN

I appreciate your thoughts on the 2 points. Any additional feedback is welcome.

sedeh
  • 7,083
  • 6
  • 48
  • 65

1 Answers1

2

The fastest method would be to set 'Letters' as the index for df_a and then call map:

In [11]:

df_a = df_a.set_index('Letters')
df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])
df_b
Out[11]:
  Letters    Greek
0       Y      NaN
1       E  epsilon
2       N       nu
3       X      chi
4       B      NaN

Just to demonstrate the flaw in your loop it is worth printing out what happens:

In [17]:

for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            print("match i:", i, "k:", k, "letter:", temp2)
        else:
            print("no match i:", i, "k:", k, "letter:", temp2)
no match i: 0 k: 0 letter: Y
match i: 0 k: 1 letter: E
no match i: 0 k: 2 letter: N
no match i: 0 k: 3 letter: X
no match i: 0 k: 4 letter: B
no match i: 1 k: 0 letter: Y
no match i: 1 k: 1 letter: E
no match i: 1 k: 2 letter: N
no match i: 1 k: 3 letter: X
no match i: 1 k: 4 letter: B
no match i: 2 k: 0 letter: Y
no match i: 2 k: 1 letter: E
no match i: 2 k: 2 letter: N
no match i: 2 k: 3 letter: X
no match i: 2 k: 4 letter: B
no match i: 3 k: 0 letter: Y
no match i: 3 k: 1 letter: E
match i: 3 k: 2 letter: N
no match i: 3 k: 3 letter: X
no match i: 3 k: 4 letter: B
no match i: 4 k: 0 letter: Y
no match i: 4 k: 1 letter: E
no match i: 4 k: 2 letter: N
no match i: 4 k: 3 letter: X
no match i: 4 k: 4 letter: B
no match i: 5 k: 0 letter: Y
no match i: 5 k: 1 letter: E
no match i: 5 k: 2 letter: N
no match i: 5 k: 3 letter: X
no match i: 5 k: 4 letter: B
no match i: 6 k: 0 letter: Y
no match i: 6 k: 1 letter: E
no match i: 6 k: 2 letter: N
no match i: 6 k: 3 letter: X
no match i: 6 k: 4 letter: B
no match i: 7 k: 0 letter: Y
no match i: 7 k: 1 letter: E
no match i: 7 k: 2 letter: N
match i: 7 k: 3 letter: X
no match i: 7 k: 4 letter: B

So although you make the initial match, you loop over those rows again and blitz them with NaN rather than skipping them.

Timings

In [22]:

df_a = df_a.set_index('Letters')
%timeit df_b['Greek'] = df_b['Letters'].map(df_a['Greek'])

1000 loops, best of 3: 710 µs per loop
In [24]:

%%timeit 
for i, row in df_a.iterrows():
    temp1 = row['Letters']
    for k, row in df_b.iterrows():
        temp2 = row['Letters']
        if temp1 == temp2:
            df_b.loc[k, "Greek"]=df_a.loc[i, "Greek"]
100 loops, best of 3: 12.7 ms per loop

Calling map here is nearly 18 times faster, this is a vectorised function and will scale much better.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks greatly. I see clearly now, esp. with the printouts. In terms of efficiency in this given example, how does the nested loops and map stack? Perhaps you can talk in terms of big O. – sedeh Dec 25 '14 at 17:13
  • Well your solution is O2 as you loop over 2 loops, this I suspect will be between On and nlogn, `map` is a vectorised function so will operate on the whole array rather than a row value at a time – EdChum Dec 25 '14 at 17:39
  • Thanks for the time analysis! I applied map in my actual dataset and for some reason my target column is not being filled in. I've something like this `df_1 = df_1.set_index("Visit"); df_2['Result'] = df_2['Visit'].map(df_1['Result'])`. Not that you can glean much from that. Any random thought why the Result column in df_2 might come in as blank? Before applying `map`, I ensured 'Visit' columns in both data frames contain [unique values](http://stackoverflow.com/questions/13352369/pandas-reindexing-only-valid-with-uniquely-valued-index-objects) using `groupby`. – sedeh Dec 25 '14 at 19:14
  • Try printing each line to see what is being returned, also does it work when running your loop code? One possibility is you have no matches or you may have done something prior to the operation but I can't guess what – EdChum Dec 25 '14 at 19:32
  • Oops, an extra whitespace somehow maybe into one of the rows. Once I corrected that, the values came roaring in as supposed. – sedeh Dec 25 '14 at 22:12