0

I'd like to extract those rows from df1 which are not existent in df2 (identity is the index). For the below example, I would expect the first row in df1 to be returned. Unfortunately, the result is empty.

import pandas as pd

df1 = pd.DataFrame({
    'level-0': ['a', 'a', 'a', 'a', 'a', 'a'],
    'level-1': ['s2', 's2', 's2', 's2', 's2', 's2'],
    'level-2': ['1', '1', '1', '1', '1', '1'],
    'level-3': ['19', '20', '21', '22', '23', '24'],
    'level-4': ['HRB', 'HRB', 'HRB', 'HRB', 'HRB', 'HRB'],
    'name': ['a', 'b', 'c', 'd', 'e', 'f']
})

df1 = df1.set_index(['level-0', 'level-1', 'level-2', 'level-3', 'level-4'], drop=False)

df2 = pd.DataFrame({
    'level-0': ['a', 'a', 'a', 'a', 'a', 'b'],
    'level-1': ['s2', 's2', 's2', 's2', 's2', 's2'],
    'level-2': ['1', '1', '1', '1', '1', '1'],
    'level-3': ['19', '20', '21', '22', '23', '24'],
    'level-4': ['HRB', 'HRB', 'HRB', 'HRB', 'HRB', 'HRB']
})
df2 = df2.set_index(['level-0', 'level-1', 'level-2', 'level-3', 'level-4'], drop=False)

# all indices that are in df1 but not in df2
df_unknown = df1[~df1.index.isin(df2.index)]
print df_unknown

What's wrong with the selection?

Update

I figured out what went wrong. The dataframes were read from an Excel file and some Series were interpreted as int, while the dataframe to compare with had its columns already converted to str. This resulted in different indices.

orange
  • 7,755
  • 14
  • 75
  • 139

1 Answers1

1

set_index is not in place by default, so df1 and df2 still have their numeric index after the call. Do either

df2.set_index(..., inplace=True)

or

df2 = df2.set_index(...)

You will see that by far the most methods in pandas work that way.

filmor
  • 30,840
  • 6
  • 50
  • 48
  • Strange. I have `df = df.set_index()` in my actual code and got it wrong in the minimal example. Not sure why the actual code doesn't work... – orange Sep 20 '14 at 09:28