Pandas getting all rows listed in one dataframe but not the other UNORDERD

Question

I cannot find an easy way to get all the rows of a data frame that are found in one dataframe but not a second dataframe if the data is unordered.

These two answers talk are solutions for ordered data:

Get rows that are present in one dataframe, but not the other

pandas get rows which are NOT in other dataframe

So just to make it clear I'm trying to get this:

In one of those related question mentioned above I found a multiindexing solution that supposedly works with unordered data, but I was unable to implement it. I am hoping theres an easier way.

let me give you an example of the data I'm working with:

DF1
col_a   col_b
1325    foo
1397    foo        #<---matching value, but not matching index in DF2
1645    foo
...     ...

DF2
col_1   col_2
1397    foo        #<---matching value, but not matching index in DF1
1500    foo
1621    foo
...     ...

Now if that were all the data in both dataframes the result for processing this specifically for DF1 would look like this:

DF1_UNIQUE
col_a   col_b
1325    foo
1645    foo

(So I'm really only caring about col_a or for DF2 col_1). Notice its missing the 1397 row. that's because it is found in DF2, so I don't want it returned to my new DF. But its not found in the same index and there in lies the problem I have. I already easily created a solution if all the matching indexes are lined up, but I don't know where to start on the indexes that aren't lined up. Can I use the merge function? Or is that the wrong tool for this job?

This code isn't entirely relevant but its the solution I came up with if all the indexes lined up correctly:

def getUniqueEntries(df1, df2):
    """takes two dataframes, returns a dataframe that is comprized of all the rows unique to the first dataframe."""
    d1columns = df1.columns
    d2columns = df2.columns
    df3 = pd.merge(df1, df2, left_on=d1columns[0], right_on=d2columns[0])
    print(df3)
    return df1[(~df1[d1columns[0]].isin(df3[d1columns[0]]))]     

def main(fileread1, fileread2, writeprefix):
    df1 = pd.read_csv(fileread1)
    df2 = pd.read_csv(fileread2)

    df3 = getUniqueEntries(df1, df2)
    df4 = getUniqueEntries(df2, df1)
    
    print(df3)
    print(df4)

    df3.to_csv(writeprefix+fileread1, index=False)
    df4.to_csv(writeprefix+fileread2, index=False)
    
if __name__ == '__main__':
    main(sys.argv[1], sys.argv[2], sys.argv[3])

score 5 · Accepted Answer · answered Aug 30 '16 at 22:05

5

Yes, you can use merge with the indicator parameter:

I renamed the columns to avoid duplicated columns You can also pass left_on and right_on

merged = DF1.merge(DF2.rename(columns={'col_1': 'col_a', 'col_2': 'col_b'}), how='left', indicator=True)
merged
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
1   1397   foo       both
2   1645   foo  left_only

Now, you can filter merged using the indicator column:

merged[merged['_merge']=='left_only']
Out: 
   col_a col_b     _merge
0   1325   foo  left_only
2   1645   foo  left_only

answered Aug 30 '16 at 22:05

ayhan

70,170
20
182
203

1

Nice, I didn't know about `indicator` parameter – MaxU - stand with Ukraine Aug 30 '16 at 22:18
thanks everyone who answered! This answer is the easiest for me to understand, I'm 1-day new to pandas, thanks again! – MetaStack Aug 30 '16 at 22:37

Alexander · Answer 2 · 2016-08-30T22:12:19.563

3

This uses boolean indexing to locate all of the rows in df1 where the values in col_a are NOT (~) in col_a of df2. It uses isin() to locate matching rows, and the negation operator (~) to find the opposite of those (i.e. the ones that don't match).

df1[~df1.col_a.isin(df2.col_a)]

You mentioned an index, but your sample data does not have one. The matching is thus done only on the values in col_a per your example.

edited Aug 30 '16 at 22:12

answered Aug 30 '16 at 21:55

Alexander

105,104
32
201
196

I'll give it a try, but isn't the `isin` matching the same indexes between the two columns? I was under that impression. – MetaStack Aug 30 '16 at 22:00
I may have implemented it wrong, but this solution didn't work. it returned the entire list, and didn't detect any matches (I believe this is due to the fact that all the matches are in different indexes, as I have mentioned). – MetaStack Aug 30 '16 at 22:05
Could you provide a better sample of your dataframe? e.g. `df1.to_dict('list')` You mention an index, but your sample does not have one. Also, is it possible to have the same value in `col_a` but different values in `col_b`? If so, how do you want to handle that? – Alexander Aug 30 '16 at 22:06

MaxU - stand with Ukraine · Answer 3 · 2016-08-30T22:36:13.143

here is a pandas equivalent for SQL (Oracle's) minus operation:

select col1, col2 from tab1
minus
select col1, col2 from tab2

in Pandas:

In [59]: df1[~df1.isin(pd.DataFrame(df2.values, columns=df1.columns).to_dict('l')).all(1)]
Out[59]:
   col_a col_b
0   1325   foo
2   1645   foo

Explanation:

In [60]: pd.DataFrame(df2.values, columns=df1.columns)
Out[60]:
  col_a col_b
0  1397   foo
1  1500   foo
2  1621   foo

In [61]: pd.DataFrame(df2.values, columns=df1.columns).to_dict('l')
Out[61]: {'col_a': [1397, 1500, 1621], 'col_b': ['foo', 'foo', 'foo']}

In [62]: df1.isin(pd.DataFrame(df2.values, columns=df1.columns).to_dict('l'))
Out[62]:
   col_a col_b
0  False  True
1   True  True
2  False  True

In [63]: df1.isin(pd.DataFrame(df2.values, columns=df1.columns).to_dict('l')).all(1)
Out[63]:
0    False
1     True
2    False
dtype: bool

Pandas getting all rows listed in one dataframe but not the other UNORDERD

3 Answers3