How to work on "age bins" in Pandas Dataframe which are saved as string?

Question

I downloaded a dataset in .csv format from kaggle which is about lego. There's a "Ages" column like this:

df['Ages'].unique()
array(['6-12', '12+', '7-12', '10+', '5-12', '8-12', '4-7', '4-99', '4+',
   '9-12', '16+', '14+', '9-14', '7-14', '8-14', '6+', '2-5', '1½-3',
   '1½-5', '9+', '5-8', '10-21', '8+', '6-14', '5+', '10-16', '10-14',
   '11-16', '12-16', '9-16', '7+'], dtype=object)

These categories are the suggested ages for using and playing with the legos. I'm intended to do some statistical analysis with these age bins. For example, I want to check the mean of these suggested ages. However, since the type of each of them is string:

type(lego_dataset.loc[0]['Ages'])
str

I don't know how to work on the data.

I've already check How to categorize a range of values in Pandas DataFrame But imagine there are 100 unique bins. It's not reasonable to prepare a list of 100 labels for each category. There should be a better way.

The number of bins depends on the aim of your analysis. You can make for example four groups: baby, child, teenager, and adult. — Mykola Zotko, Oct 04 '19 at 10:41

score 1 · Answer 1 · answered Oct 04 '19 at 09:05

Not entirely sure what output you are looking for. See if the below code & output helps you.

df['Lage'] = df['Ages'].str.split('[-+]').str[0]
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]

or

df['Lage'] = df['Ages'].str.extract('(\d+)', expand=True) #you don't get the fractions for row 17 & 18
df['Uage'] = df['Ages'].str.split('[-+]').str[-1]

Input

Output1

Ages    Lage    Uage
0   6-12    6   12
1   12+     12  
2   7-12    7   12
3   10+     10  
4   5-12    5   12
5   8-12    8   12
6   4-7     4   7
7   4-99    4   99
8   4+  4   
9   9-12    9   12
10  16+     16  
11  14+     14  
12  9-14    9   14
13  7-14    7   14
14  8-14    8   14
15  6+  6   
16  2-5     2   5
17  1½-3    1½  3
18  1½-5    1½  5
19  9+  9   
20  5-8     5   8
21  10-21   10  21
22  8+  8   
23  6-14    6   14
24  5+  5   
25  10-16   10  16
26  10-14   10  14
27  11-16   11  16
28  12-16   12  16
29  9-16    9   16
30  7+  7

Output2

Ages    Lage    Uage
0   6-12    6   12
1   12+     12  
2   7-12    7   12
3   10+     10  
4   5-12    5   12
5   8-12    8   12
6   4-7     4   7
7   4-99    4   99
8   4+  4   
9   9-12    9   12
10  16+     16  
11  14+     14  
12  9-14    9   14
13  7-14    7   14
14  8-14    8   14
15  6+  6   
16  2-5     2   5
17  1½-3    1   3
18  1½-5    1   5
19  9+  9   
20  5-8     5   8
21  10-21   10  21
22  8+  8   
23  6-14    6   14
24  5+  5   
25  10-16   10  16
26  10-14   10  14
27  11-16   11  16
28  12-16   12  16
29  9-16    9   16
30  7+  7

How to work on "age bins" in Pandas Dataframe which are saved as string?

1 Answers1