1

I have a pandas dataframe with 10 keys. If I try to access a column that is not present, even then it returns a NaN for this. I was expecting a KeyError. How is pandas not able to identify the missing column ?

In the example below, vendor_id is a valid column in dataframe. The other column is absent from the dataset.

final_feature.ix[:,['vendor_id','this column is absent']]
Out[1017]: 
  vendor_id  this column is absent
0    434236                    NaN

type(final_feature)
Out[1016]: pandas.core.frame.DataFrame

EDIT 1: Validated that no null values are there

print (final_feature1.isnull().values.any())
ForeverLearner
  • 1,901
  • 2
  • 28
  • 51

2 Answers2

1

For me works select by subset:

final_feature[['vendor_id','this column is absent']]

KeyError: "['this column is absent'] not in index"

Also ix is deprecated in last version of pandas (0.20.1), check here.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

This is expected behaviour and is due to the feature setting with enlargement

In [15]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df.ix[:,['a','d']]

Out[15]:
          a   d
0 -1.164349 NaN
1  0.400116 NaN
2 -0.599496 NaN
3  0.186837 NaN
4  0.385656 NaN

If you try df['d'] or df[['a','d']] then you will get a KeyError

Effectively what you're doing is reindexing, the fact the column doesn't exists when using ix doesn't matter, you'll just get a column of NaNs

Same behaviour is observed using loc:

In [24]:
df.loc[:,['a','d']]

Out[24]:
          a   d
0 -1.164349 NaN
1  0.400116 NaN
2 -0.599496 NaN
3  0.186837 NaN
4  0.385656 NaN

When you don't use ix or loc and try to do df['d'] you're trying to index a specific column or list of columns, there is no expectation of enlargement here unless you are assigning to a new column: e.g. df['d'] = some_new_vals

To guard against this you can validate your list using isin with the columns:

In [26]:
valid_cols = df.columns.isin(['a','d'])
df.ix[:, valid_cols]

Out[26]:
          a
0 -1.164349
1  0.400116
2 -0.599496
3  0.186837
4  0.385656

Now you will only see columns that exist, plus if you have mis-spelt any columns then it will also guard against this

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thank you so much. Do you suggest removing all instances of .ix from the code? A spelling mistake is how I ran into this issue – ForeverLearner May 11 '17 at 09:50
  • It will work until some future version, from version 0.20.1 it's been marked for deprecation but it still works. The [docs](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0200-api-breaking-deprecate-ix) show how to achieve the same behaviour but the behaviour will still happen as I've demonstrated with `loc` but using `isin` against your existing columns will protect against this – EdChum May 11 '17 at 09:53
  • Thanks. I will keep that check in place. – ForeverLearner May 11 '17 at 09:57