Identifying similar values by iterating between two Pandas dataframes.

Question

I have 2 Pandas dataframes which are of unequal length. I have quoted an example below.My code should run through the value of apples in the 1st data frame and locate if it exists in the 2nd one(there will always be a value existing in the 2nd dataframe). If it finds a same value then it should store the difference of oranges in the 2 data frames into the 1st data frame. I have performed this task using 2 for loops which is also given below. The code below does the task but my actual data has 2 million entries and the second data frame has 800 entries. To use 2 for loops slows my program a lot. Is there a more efficient way of doing this task?

trial={'apples': [2,4,1,5,3,2,1,1,4,5],'oranges': [8,5,9,4,2,6,7,5,1,3]}
trial1={'apples': [1,2,3,4,5],'oranges': [2,5,6,3,1]}
df=pd.DataFrame.from_dict(trial)
df1=pd.DataFrame.from_dict(trial1)
F=[]
for i in df.apples.index:
    for j in df1.apples.index:
        if  df.apples.ix[i]== df1.apples.ix[j]:
           F.append(df.oranges.ix[i]-df1.oranges.ix[j])                 
df['difference']=F

EdChum · Accepted Answer · 2017-02-02T11:54:36.720

You can perform a left type merge on the apples column, you can then call diff on the clashing columns orange_x and orange_y, and then invert the sign using - and cast to int using astype:

In [159]:
df['difference'] = -df.merge(df1, on='apples', how='left').ix[:, 'oranges_x':].diff(axis=1)['oranges_y'].astype(int)
df

Out[159]:
   apples  oranges  difference
0       2        8           3
1       4        5           2
2       1        9           7
3       5        4           3
4       3        2          -4
5       2        6           1
6       1        7           5
7       1        5           3
8       4        1          -2
9       5        3           2

breaking the above down:

In [162]:
df.merge(df1, on='apples', how='left')

Out[162]:
   apples  oranges_x  oranges_y
0       2          8          5
1       4          5          3
2       1          9          2
3       5          4          1
4       3          2          6
5       2          6          5
6       1          7          2
7       1          5          2
8       4          1          3
9       5          3          1

In [163]:
df.merge(df1, on='apples', how='left').ix[:, 'oranges_x':].diff(axis=1)

Out[163]:
   oranges_x  oranges_y
0        NaN       -3.0
1        NaN       -2.0
2        NaN       -7.0
3        NaN       -3.0
4        NaN        4.0
5        NaN       -1.0
6        NaN       -5.0
7        NaN       -3.0
8        NaN        2.0
9        NaN       -2.0

In [164]:
-df.merge(df1, on='apples', how='left').ix[:, 'oranges_x':].diff(axis=1)['oranges_y'].astype(int)

Out[164]:
0    3
1    2
2    7
3    3
4   -4
5    1
6    5
7    3
8   -2
9    2
Name: oranges_y, dtype: int32

If your real data has more columns you can do this in separate steps if the column order is different from your example:

In [170]:
merged = df.merge(df1, on='apples', how='left')
merged['difference'] = merged['oranges_x'] - merged['oranges_y']
merged

Out[170]:
   apples  oranges_x  oranges_y  difference
0       2          8          5           3
1       4          5          3           2
2       1          9          2           7
3       5          4          1           3
4       3          2          6          -4
5       2          6          5           1
6       1          7          2           5
7       1          5          2           3
8       4          1          3          -2
9       5          3          1           2

So after merging, drop the extraneous columns and rename back:

In [171]:    
merged = merged.drop('oranges_y', axis=1).rename(columns={'oranges_x':'oranges'})
merged

Out[171]:
   apples  oranges  difference
0       2        8           3
1       4        5           2
2       1        9           7
3       5        4           3
4       3        2          -4
5       2        6           1
6       1        7           5
7       1        5           3
8       4        1          -2
9       5        3           2

Hi, It does work. But for my data the difference for all columns is nan. I can , however, see that there are matching values in the 2 dataframes. The 2 for loop method works and gives results. — Arun Krishnan, Feb 02 '17 at 13:30
I can't comment on your real data unless you post it, we can only answer what is put in front of us and what we can reproduce. — EdChum, Feb 02 '17 at 13:36
Sorry, I definitely understand that. But the original data is too big and I have already carried out a number of filtering and analysis to arrive at the two data frames. I will see if I can find the fault or repost the question. Thanks — Arun Krishnan, Feb 02 '17 at 13:53
Check the `dtypes` of the dfs, also you should do this on a small sample of data to see if there is something fundamentally wrong — EdChum, Feb 02 '17 at 13:56
I found the problem. When I filtered the data from the raw_data and created the dataframe 'df', I did not reindex from 0. The error was due to that.Thank you — Arun Krishnan, Feb 02 '17 at 14:47

Identifying similar values by iterating between two Pandas dataframes.

1 Answers1