2

It's convenient to chain filters on a DataFrame using query:

# quoting from the SO answer above
df = pd.DataFrame( np.random.randn(30,3), columns = ['a','b','c'])
df_filtered = df.query('a>0').query('0<b<2')

What if I need to do the same to a Series:

df = pd.DataFrame({'a': [0, 0, 1, 1, 2, 2], 'b': [1, 2, 3, 4, 5, 6]})
df.groupby('a').b.sum().query('? > 3').query('? % 3 == 1')

Series.query doesn't exist (for a good reason, most of the query syntax is to allow access to multiple columns).

Community
  • 1
  • 1
max
  • 49,282
  • 56
  • 208
  • 355

1 Answers1

3

you can use to_frame() method:

In [10]: df.groupby('a').b.sum().to_frame('v').query('v > 3').query('v % 3 == 1')
Out[10]:
   v
a
1  7

if you need result as series:

In [12]: df.groupby('a').b.sum().to_frame('v').query('v > 3').query('v % 3 == 1').v
Out[12]:
a
1    7
Name: v, dtype: int64

does to_frame() involve copying of the series?

It involves a call of the DataFrame constructor:

https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L1140:

df = self._constructor_expanddim({name: self})

https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L265:

def _constructor_expanddim(self):
    from pandas.core.frame import DataFrame
    return DataFrame

Performance impact (testing against 600K rows DF):

In [66]: %timeit df.groupby('a').b.sum()
10 loops, best of 3: 46.2 ms per loop

In [67]: %timeit df.groupby('a').b.sum().to_frame('v')
10 loops, best of 3: 49.7 ms per loop

In [68]: 49.7 / 46.2
Out[68]: 1.0757575757575757

Performance impact (testing against 6M rows DF):

In [69]: df = pd.concat([df] * 10, ignore_index=True)

In [70]: df.shape
Out[70]: (6000000, 2)

In [71]: %timeit df.groupby('a').b.sum()
1 loop, best of 3: 474 ms per loop

In [72]: %timeit df.groupby('a').b.sum().to_frame('v')
1 loop, best of 3: 464 ms per loop

Performance impact (testing against 60M rows DF):

In [73]: df = pd.concat([df] * 10, ignore_index=True)

In [74]: df.shape
Out[74]: (60000000, 2)

In [75]: %timeit df.groupby('a').b.sum()
1 loop, best of 3: 4.28 s per loop

In [76]: %timeit df.groupby('a').b.sum().to_frame('v')
1 loop, best of 3: 4.3 s per loop

In [77]: 4.3 / 4.28
Out[77]: 1.0046728971962615

Conclusion: the performance impact doesn't seem to be that big...

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • I would need to convert it back to a Series right after to use Series specific operations, but I guess this roundtrip conversion is unavoidable. – max Aug 17 '16 at 19:08
  • @max, yeah, that's pretty easy - i've added it to my answer – MaxU - stand with Ukraine Aug 17 '16 at 19:10
  • yes, but I was wondering about performance impact more than about the extra code. does to_frame() involve copying of the series? – max Aug 17 '16 at 20:31
  • @max, i've added timings for differently sized DFs - please check – MaxU - stand with Ukraine Aug 17 '16 at 20:50
  • Thx. Also, I just realized that it doesn't copy the Series (we can verify it by changing an item in it). But even if it did, it's actually kinda fast... (0.25 sec for 100 million integers, 0.7 sec for 100 million strings). – max Aug 17 '16 at 22:40