0

Say I have a dataframe in Pandas consisting of several columns. I would like to select the subset of the dataframe where all columns are between two values min and max.

How can I do this using query? Do I need to specify every column name one by one in the expression e.g.:

df.query('(A<{max} & A>{min}) & (B{min} & B{max}) & (C{min} & C{max})'.format(min,max))

Aside from query, what other alternatives do I have? Perhaps working with the indices where the condition is met directly?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564
  • Since your question is pretty value, an answer will have to be very vague. Show us what exactly you'd write with `query` specifying every column name one by one (and with example data, too), and someone can hopefully show you how to write it without doing so. – abarnert Aug 22 '14 at 21:40

2 Answers2

4

For your use case I'd do it the non-query way, constructing a boolean frame and then calling the .all method:

>>> minval, maxval = 20, 80
>>> df = pd.DataFrame(np.random.randint(0, 100, (20,6)))
>>> df[((df > minval) & (df < maxval)).all(axis=1)]
     0   1   2   3   4   5
2   74  30  30  76  31  66
10  49  39  71  43  30  50

I don't think the advantage of manually constructing an expanded query would be worth it.

DSM
  • 342,061
  • 65
  • 592
  • 494
3

I'm not sure why you think you need query here. But if you want to do it this way, you can. You're already building up the query string dynamically with that format statement; you just need to make it more dynamic, by joining a comprehension. For example:

qs = ' & '.join('{col}<{max} & {col}>{min}'.format(col, min, max)
                for col in df.columns)
df.query(qs)

I would personally do this the way @DSM suggested instead. Unless you actually need the query string for some reason (maybe to print it out?), building a string to eval it as an expression, instead of building an expression, is usually the wrong answer.

abarnert
  • 354,177
  • 51
  • 601
  • 671