2

I have using Pandas for data processing before training a binary classifier. One of the things I could not find was a function that tells me given a value of a certain feature, let's say Age (people who are for example 60 years old) which percentage of this people are classified as 1 or as 0 (in the binary data column). And this for all different ages in the Age column.

A simple example to illustrate my idea. I have the following DataFrame:

import pandas as pd

data = pd.DataFrame({'Age': [23, 24, 23 ,25 ,24 ,24 ,20], 'label': [0, 1, 1, 0, 1, 1, 0]})

and I want a function that gives me the percentage of people from all ages that are labeled as 0 or as 1. Like so:

   Age   Percentage
0   20     0.0
1   23     0.5
2   24     1.0
3   25     0.0

Is there any function already implementing that? Because I could not find one and I find this a pretty common need for data analysis in binary classification problems.

Thank you!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
erni
  • 57
  • 7
  • This is a pure pandas question, and has nothing to do with `machine-learning` or `scikit-learn` - kindly do not spam irrelevant tags (removed). – desertnaut Aug 18 '20 at 12:40

1 Answers1

1

Just do a groupby mean:

>>> data.groupby('Age').mean()
     label
Age       
20     0.0
23     0.5
24     1.0
25     0.0

Reset the index to get it exactly how you posted your expected output

>>> data.groupby('Age').mean().reset_index()
   Age  label
0   20    0.0
1   23    0.5
2   24    1.0
3   25    0.0
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65