0

I have to create bins based on age. There are some missing values (nan) that need to be changed as "N/A and assign to a new category as "Not_Availabe". I filled in the missing values and then transformed the strings to float.

Students.loc[:, AGE']=Students.loc[:,AGE'].fillna("N/A")
Students.loc[:,AGE'] = Students.loc[:,'AGE'].str.replace('\%', '', regex=True).astype(float)

When I do this I get an error message as "could not convert string to float: 'N/A'. Then, I tried to use pd.cut and assign bins and labels, but nothing work.

If I just do it an error message is "not supported between instances of 'int' and 'str*

Code:

Students.loc[:,'AGE']=Students.loc[:,'AGE'].fillna("unknown")

Students.loc[:,'AGE'] = Students.loc[:,'AGE'].str.replace('\%', '', regex=True).astype(float)

Students.loc[:,'AGE'] = Students.loc[:,'AGE'].cat.add_categories("Not_Available")

Students.loc[:,'AGE']=pd.cut(x=Students.loc[:,'AGE'],bins=[0,18,30,50,65,75,100],labels=["Unknown,"18 and under", "19-30", "31-50", "51-65", "66-75","75+"])

The output should be similar as:

Not_Availabe: 10 
18 and under: 16 
19-30: 80
31-50: 15
51-65: 5
66-75: 2
75+: 1
cottontail
  • 10,268
  • 18
  • 50
  • 51
Mo_Phy
  • 1

1 Answers1

1

Convert the AGE column to float first, to avoid trying to convert string to float:

df['AGE'] = df['AGE'].str.replace('%', '', regex=True).astype(float)

I would suppose to then replace missing AGE values with -1 instead of 'N/A' to simplify binning.

df['AGE'] = df['AGE'].fillna(-1)

So, strictly speaking, it would not have been necessary here to do the conversion to float first to avoid the error message. Nevertheless this is still the sensible order of steps.

Then, define custom ranges and do the binning:

ranges = [-1, 0, 18, 30, 50, 65, 75, np.inf]
labels = ['Not_Availabe', '18 and under', '19-30', '31-50', '51-65', '66-75', '75+']
df['AGE_labeled'] = pd.cut(df['AGE'], bins=ranges, labels=labels)
# print the result
print(df['AGE_labeled'].value_counts())

Note: I created the following dummy data for testing (hope it fits the question, as no sample data was given)

import numpy as np
import pandas as pd
# dummy data
length = 1000
data = {
    'STUDENT_ID': np.random.randint(1000, 9999, length),
    'AGE': np.random.randint(0, 99, length),
}
df = pd.DataFrame (data)
df['AGE'] = df['AGE'].astype(str)  # convert AGE value to str (as I assume from your question that this is the case)
df.loc[df.sample(frac=0.1).index,'AGE'] = np.nan  # randomly set some values to nan
rosa b.
  • 1,759
  • 2
  • 15
  • 18