15
   Survived  SibSp  Parch
0         0      1      0
1         1      1      0
2         1      0      0
3         1      1      0
4         0      0      1

Given the above dataframe, is there an elegant way to groupby with a condition? I want to split the data into two groups based on the following conditions:

(df['SibSp'] > 0) | (df['Parch'] > 0) =   New Group -"Has Family"
 (df['SibSp'] == 0) & (df['Parch'] == 0) = New Group - "No Family"

then take the means of both of these groups and end up with an output like this:

               SurvivedMean
 Has Family    Mean
 No Family     Mean

Can it be done using groupby or would I have to append a new column using the above conditional statement?

feetwet
  • 3,248
  • 7
  • 46
  • 84
George Vince
  • 179
  • 1
  • 2
  • 7
  • This looks like a job for boolean indexing. – cs95 Jul 13 '17 at 14:06
  • Is your df coded in binary? If so, you may be able to use the pandas method get_dummies {https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html}. Otherwise, yes, I would recommend/think you should create a new column (you would only need one I think) to perform the groupby on. I can help write some code if I have a better idea of what you're doing! Also, given your desired output, it seems like you will need to pivot the db as well! – Graham Streich Jul 13 '17 at 14:07

3 Answers3

16

An easy way to group that is to use the sum of those two columns. If either of them is positive, the result will be greater than 1. And groupby accepts an arbitrary array as long as the length is the same as the DataFrame's length so you don't need to add a new column.

family = np.where((df['SibSp'] + df['Parch']) >= 1 , 'Has Family', 'No Family')
df.groupby(family)['Survived'].mean()
Out: 
Has Family    0.5
No Family     1.0
Name: Survived, dtype: float64
ayhan
  • 70,170
  • 20
  • 182
  • 203
2

Use only one condition if never values in columns SibSp and Parch are less as 0:

m1 = (df['SibSp'] > 0) | (df['Parch'] > 0)

df = df.groupby(np.where(m1, 'Has Family', 'No Family'))['Survived'].mean()
print (df)
Has Family    0.5
No Family     1.0
Name: Survived, dtype: float64

If is impossible use first use both conditions:

m1 = (df['SibSp'] > 0) | (df['Parch'] > 0)
m2 = (df['SibSp'] == 0) & (df['Parch'] == 0)
a = np.where(m1, 'Has Family', 
    np.where(m2, 'No Family', 'Not'))

df = df.groupby(a)['Survived'].mean()
print (df)
Has Family    0.5
No Family     1.0
Name: Survived, dtype: float64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

You could define your conditions in a list and use the function group_by_condition below to create a filtered list for each condition. Afterwards you can select the resulting items using pattern matching:

df = [
  {"Survived": 0, "SibSp": 1, "Parch": 0},
  {"Survived": 1, "SibSp": 1, "Parch": 0},
  {"Survived": 1, "SibSp": 0, "Parch": 0}]

conditions = [
  lambda x: (x['SibSp'] > 0) or (x['Parch'] > 0),  # has family
  lambda x: (x['SibSp'] == 0) and (x['Parch'] == 0)  # no family
]

def group_by_condition(l, conditions):
    return [[item for item in l if condition(item)] for condition in conditions]

[has_family, no_family] = group_by_condition(df, conditions)
Zwackelmann
  • 557
  • 1
  • 5
  • 19