4

Accessing Pandas dataframe in some cases does not raise exception even when the columns labels are not existed.

How should I check for these cases, to avoid reading wrong results?

a = pd.DataFrame(np.zeros((5,2)), columns=['la', 'lb'])

a
Out[349]: 
    la   lb
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0
3  0.0  0.0
4  0.0  0.0

a.loc[:, 'lc']  # Raised exception as expected.

a.loc[:, ['la', 'lb', 'lc']]  # Not expected.
Out[353]: 
    la   lb  lc
0  0.0  0.0 NaN
1  0.0  0.0 NaN
2  0.0  0.0 NaN
3  0.0  0.0 NaN
4  0.0  0.0 NaN

a.loc[:, ['la', 'wrong_lb', 'lc']]  # Not expected.
Out[354]: 
    la  wrong_lb  lc
0  0.0       NaN NaN
1  0.0       NaN NaN
2  0.0       NaN NaN
3  0.0       NaN NaN
4  0.0       NaN NaN

Update: There is a suggested duplicate question (Safe label-based selection in DataFrame), but it's about row selection, my question is about column selection.

Community
  • 1
  • 1
THN
  • 3,351
  • 3
  • 26
  • 40
  • Possible duplicate of [Safe label-based selection in DataFrame](http://stackoverflow.com/questions/40204834/safe-label-based-selection-in-dataframe) – juanpa.arrivillaga Mar 08 '17 at 10:39
  • I didn't see that question, but it's about row selection, my question is about column selection. – THN Mar 08 '17 at 10:47
  • It's about label-based selection using `loc`, the principles are the exact same. – juanpa.arrivillaga Mar 08 '17 at 16:56
  • I didn't know they are the same or not, shouldn't this information go into the explanation in an answer? Moreover, the accepted answer here is different than there. Actually, I hoped that there would be a pandas feature solving my issue, but manually filtering the columns is the way to go as pointed out in the answer. – THN Mar 09 '17 at 08:07

1 Answers1

5

it looks like because at least one of the columns exists, it returns an enlarged df as a reindex operation.

You could define a user func that validates the columns which will handle whether the column exists or not. Here I construct a pandas Index object from the passed in iterable and call intersection to return the common values from the existing df and passed in iterable:

In [80]:
def val_cols(cols):
    return pd.Index(cols).intersection(a.columns)
​
a.loc[:, val_cols(['la', 'lb', 'lc'])] 

Out[80]:
    la   lb
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0
3  0.0  0.0
4  0.0  0.0

This also handles completely missing columns:

In [81]:
a.loc[:, val_cols(['x', 'y'])] 

Out[81]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

This also handles your latter case:

In [83]:
a.loc[:, val_cols(['la', 'wrong_lb', 'lc'])]

Out[83]:
    la
0  0.0
1  0.0
2  0.0
3  0.0
4  0.0

update

in the case where you want to just test if all are valid you can just iterate over each column in the list and append the duff columns:

In [93]:
def val_cols(cols):
    duff=[]
    for col in cols:
        try:
            a[col]
        except KeyError:
            duff.append(col)
    return duff
invalid = val_cols(['la','x', 'y'])
print(invalid)

['x', 'y']
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks. What if I want to catch when the column names do not exist? For example, 'wrong_lb' is a typo and I want to catch and fix that? – THN Mar 08 '17 at 10:42
  • You can just iterate over each column and then catch the `KeyError` exception, see updated answer – EdChum Mar 08 '17 at 10:47
  • Thanks, so I will try to do somethings in the helper method val_cols(). – THN Mar 08 '17 at 10:49
  • Yes, it's by design this behaviour but crafting your own func to do something different is not too difficult as I've shown – EdChum Mar 08 '17 at 10:49