2

I have a DataFrame that looks like this:

| Age | Married | OwnsHouse |
| 23  | True    | False     |
| 35  | True    | True      |
| 14  | False   | False     |
| 27  | True    | True      |

I want to find the highest age of anyone who is married and owns a house. The answer here would be 35. My first thought was to do:

df_subset = df[df['Married'] == True and df['OwnsHouse'] == True]
max_age = df_subset.max()

However, dataset is big (50MB) and I fear this will be computationally expensive as it goes through the dataset twice.

My second thought was to do:

max_age = 0
for index, row in df.iterrows():
    if(row[index]['Married] and row['index']['OwnsHouse'] and row[index]['Age] > max_age):
    max_age = row[index]['Age']

Is there a faster way of doing this?

0xPrateek
  • 1,158
  • 10
  • 28
Anna
  • 379
  • 5
  • 16
  • 2
    Your first thought is the way to go. 50MB is, in fact, tiny. 2nd approach is not recommended. You can do `df.loc[df['Married']& df['OwnsHouse'], 'Age'].max()`. – Quang Hoang Jun 26 '19 at 23:26
  • 1
    Have you checked the performance of both the methods ? – 0xPrateek Jun 26 '19 at 23:29
  • `df.iterrows` is an antipattern in pandas, it will generally always be worse-performance than any vectorized method, or logical indexing. – smci Jun 26 '19 at 23:51
  • Declaring a 50MB intermediate result `df_subset = df[df['Married'] == True and df['OwnsHouse'] == True]` is unnecessary and wastes both CPU and memory for no reason. You should chain calls, as @QuangHoang shows. – smci Jun 26 '19 at 23:52

1 Answers1

6

Your first approach is solid, but here is a simple option:

df[df['Married'] & df['OwnsHouse']].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

Or, just the age:

df.loc[df['Married'] & df['OwnsHouse'], 'Age'].max()
# 35

If you have multiple boolean columns, I'd suggest something a bit more scalable,

df[df[['Married', 'OwnsHouse']].all(axis=1)].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

Where,

df[['Married', 'OwnsHouse']].all(axis=1)

0    False
1     True
2    False
3     True
dtype: bool

Which is the same as,

df['Married'] & df['OwnsHouse']

0    False
1     True
2    False
3     True
dtype: bool

But instead of manually finding the AND of N boolean masks, have .all do it for you.

query is another option:

df.query("Married and OwnsHouse")['Age'].max()
# 35

It doesn't require an intermediate step of computing a mask.


Your method is fast enough, but if you want to microoptimize, here are some more options with numpy:

# <= 0.23
df[(df['Married'].values & df['OwnsHouse'].values)].max()
df[df[['Married', 'OwnsHouse']].values.all(axis=1)].max()
# 0.24+
df[(df['Married'].to_numpy() & df['OwnsHouse'].to_numpy())].max()
df[df[['Married', 'OwnsHouse']].to_numpy().all(axis=1)].max()

Age          35.0
Married       1.0
OwnsHouse     1.0
dtype: float64

Although you probably want just the age. Do this

df.loc[(df['Married'].to_numpy() & df['OwnsHouse'].to_numpy()), 'Age'].max()
# 35

If you fancy some more numpy, do this:

df.loc[(
   df['Married'].to_numpy() & df['OwnsHouse'].to_numpy()), 'Age'
].to_numpy().max()
# 35

Or better still, throw away pandas,

df['Age'].to_numpy()[df['Married'].to_numpy() & df['OwnsHouse'].to_numpy()].max()
# 35
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Actually, `len("df['Married']& df['OwnsHouse']")==30` and `len("df[['Married', 'OwnsHouse']].all(axis=1)")==40` :D – Quang Hoang Jun 26 '19 at 23:29
  • @QuangHoang great but what about `df[col1] & df[col2] & ... df[col100]`? :D – cs95 Jun 26 '19 at 23:30
  • @QuangHoang I realise the terminology isn't quite right there, so you've a point. – cs95 Jun 26 '19 at 23:32
  • Is this method faster then the first method he's is trying ? – 0xPrateek Jun 26 '19 at 23:33
  • @0xPrateek Sure, it is now. – cs95 Jun 26 '19 at 23:34
  • have you checked it using timeit ? or something else / – 0xPrateek Jun 26 '19 at 23:37
  • 1
    @0xPrateek You mean have I timed whether using numpy functions is faster than the equivalent pandas method? Yes. [Here](https://stackoverflow.com/a/54299629/4909087). It'll probably be make a lot more difference here though. – cs95 Jun 26 '19 at 23:39
  • @cs95 yes...! Can be understood from that answer..! Thanks for clearing doubts :) – 0xPrateek Jun 26 '19 at 23:43
  • @smci `&` is only being used here in a vectorized context. What specifically are you referring to? Note that with `query` it makes no difference because of how pandas builds the parse tree from the string. – cs95 Jun 26 '19 at 23:58
  • @cs95 nice detailed answer. – Quang Hoang Jun 27 '19 at 00:22
  • 1
    @QuangHoang Thanks! I've used a grenade launcher when I probably should've used a fly swatter, but I'm pretty satisfied with the level of detail. – cs95 Jun 27 '19 at 00:24
  • @smci From the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing): "_Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not._" I can't see why you'd call a practice recommended by the documentation to be sloppy. – cs95 Jun 27 '19 at 16:29
  • 1
    @cs95 Ah, my mistake. It's different from both R and C – smci Jun 27 '19 at 18:25