Alternatives to awkward Pandas/Python Dataframe Indexing: df_REPEATED[df_REPEATED['var']]>0?

Question

In Pandas/Python, I have to write the dataframe name twice when conditioning on its own variable:

df_REPEATED[df_REPEATED['var']>0]

This happens so many times it seems unreasonable. 90-99% of users would be happy 95% of the time with something like:

df_REPEATED[['var']>0]

This syntax is also necessary using .loc[]. Is there any alternative or shortcut to writing this?

On the other hand, is there some use case I don't understand and actually my education in python has been woefully insufficient?

Pietro Battiston · Accepted Answer · 2018-03-22T14:40:41.570

Not an official answer... but it already made my life simpler recently:

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don't need to download the entire repo: saving the file and doing

from where import Where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

EDIT: this answer mentions an analogous approach not requiring external components, resulting in:

data = (pd.read_csv('ugly_db.csv')
          .loc[lambda df : ~(df == '$null$').any(axis=1)])

and another possibility is to use .apply(), as in

data = (pd.read_csv('ugly_db.csv')
          .pipe(lambda df : ~(df == '$null$').any(axis=1)))

score 1 · Answer 2 · answered Jul 04 '17 at 19:12

1

df_REPEATED['var'] > 0 is a boolean array. Other than its length, it has no connection to the DataFrame. It could have been the result of another expression, say another_df['another_var'] > some_other_value, as long as the lengths match. So it offers flexibility. If the syntax was like the one you suggested, we couldn't do this. However, there are alternatives to what you are asking. For example,

df_REPEATED.query('var > 0')

query can be very fast if the DataFrame is large and it is less verbose but it lacks the advantages of boolean indexing and you start having troubles if the expression gets complicated.

answered Jul 04 '17 at 19:12

ayhan

70,170
20
182
203

Ok, I understand the object being returned by df_REPEATED['var'] > 0 and the potential flexibility of boolean indexing, but this isn't needed most of the time. Is `df_REPEATED.query('var > 0')` the best we can do? – Courtney Kristensen Jul 04 '17 at 19:13
As far as I know, yes. There is also lambda expressions `df_REPEATED[lambda x: x['var'] > 0]` (requires pandas 0.18) but I wouldn't say it is better. It becomes useful when you have a long name for the DataFrame and you need to use it several times while indexing. – ayhan Jul 04 '17 at 19:18
1

Because of the way Python syntax works, `['var']>0` is evaluated -- and in modern Python will fail with a TypeError -- before pandas even sees it. Using a string argument is one way pandas gets around this limitation. – DSM Jul 04 '17 at 19:25

Alternatives to awkward Pandas/Python Dataframe Indexing: df_REPEATED[df_REPEATED['var']]>0?

2 Answers2

Linked