3

Say I have a column in a dataframe which is 'user_age', and I have created 'user_age_bin' by something like:

df['user_age_bin']= pd.cut(df['user_age'], bins=[10, 15, 20, 25,30])

Then I build a machine learning model by using the 'user_age_bin' feature.

Next, I got one record which I need to throw into my model and make prediction. I don't want to use the user_age as it is because the model uses user_age_bin. So, how can I convert a user_age value (say 28) into user_age_bin? I know I can create a function like this:

def assign_bin(age):
    if age < 10:
        return '<10'
    elif age< 15:
        return '10-15'
     ... etc. etc.

and then do:

user_age_bin = assign_bin(28)

But this solution is not elegant at all. I guess there must be a better way, right?

Edit: I changed the code and added explicit bin range. Edit2: Edited wording and hopefully the question is clearer now.

user3768495
  • 4,077
  • 7
  • 32
  • 58
  • 1
    You can always pass an array as bins: `bins=[0,10,15,20,30]`. – Quang Hoang Feb 27 '20 at 22:17
  • 1
    Specify your bin intervals to `pd.cut` – ifly6 Feb 27 '20 at 22:17
  • Thank @QuangHoang and @ifly6, I am not still not sure. My question is that, AFTER I have done the `pd.cut`, and I got a new age value, and I need to replace the value with its corresponding bin. I can do it by using a function like `assign_bin` above, but I think it's a dumb way to do it. I am looking for a smart way to do it. Thanks! – user3768495 Feb 27 '20 at 22:31
  • That’s what `map` is for, or `np.select` – Quang Hoang Feb 27 '20 at 22:32
  • @QuangHoang, those sound like what I am looking for. Could you please give a more explicit answer? I am aware of the `df[col].map(dict)` method but I don't know how to get the `dict` I need when doing the `pd.cut`. Thanks! – user3768495 Feb 27 '20 at 22:37
  • Try a similar approach to https://stackoverflow.com/questions/7934547/python-find-closest-key-in-a-dictionary-from-the-given-input-key – m-dz Feb 28 '20 at 01:25

4 Answers4

4

tl;dr: np.digitize is a good solution.

After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!

Setup

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)

# sort by age 
print(df.sort_values('user_age'))

Output:

 user_age  user_age_bin
0          5             0
1         10             0
2         15             1
3         20             2
4         25             3
5         30             4
6         35             5
7         40             5
8         45             5
9         50             5
10        55             5
11        60             5
12        65             5
13        70             5
14        75             5
15        80             5
16        85             5

Assign category:

# a new age value
new_age=30

# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)

Output:

4
user3768495
  • 4,077
  • 7
  • 32
  • 58
1

A bit ugly approach with double list comprehension down the line, but seems to do the job.

Setup:

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [10, 15, 20, 25, 30, np.Inf]
labels = bins[1:]
ages = np.random.randint(10, 35, 10)
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=labels)
print(df)

Out:

   user_age user_age_bin
0        16         20.0
1        29         30.0
2        24         25.0
3        20         20.0
4        17         20.0
5        30         30.0
6        16         20.0
7        28         30.0
8        32          inf
9        20         20.0

Assignment:

# `new_ages` is what you want to assign labels to, used `ages` for simplicity
new_ages = ages
ids = [np.argmax([age <= x for x in labels]) for age in new_ages]
assigned_labels = [labels[i] for i in ids]
print(pd.DataFrame({"new_ages": new_ages, "assigned_labels": assigned_labels, "user_age_bin": df["user_age_bin"]}))

Out:

   new_ages  assigned_labels user_age_bin
0        16             20.0         20.0
1        29             30.0         30.0
2        24             25.0         25.0
3        20             20.0         20.0
4        17             20.0         20.0
5        30             30.0         30.0
6        16             20.0         20.0
7        28             30.0         30.0
8        32              inf          inf
9        20             20.0         20.0
m-dz
  • 2,342
  • 17
  • 29
0

You can try something like:

bins=[10, 15, 20, 25, 30]
labels = [f'<{bins[0]}', *(f'{a}-{b}' for a, b in zip(bins[:-1], bins[1:])), f'{bins[-1]}>']
pd.cut(df['user_age'], bins=bins, labels=labels)

Note that if you are using python<3.7 you should replace f-string by format like syntax.

garciparedes
  • 1,749
  • 2
  • 18
  • 34
  • thank you! but this is not what I am asking for. My question was probably not clear. I just edited it a little and hopefully it is clearer now. – user3768495 Feb 27 '20 at 22:52
0

You can't put strings into a model so you'll need to create a mapping and keep track of it or create a seperate columnn to use later

def apply_age_bin_numeric(value):
    if value <= 10:
        return 1
    elif value > 10 and value <= 20:
        return 2
    elif value > 21 and value <= 30:
        return 3  
    etc....  

def apply_age_bin_string(value):
    if value <= 10:
        return '<=10'
    elif value > 10 and value <= 20:
        return '11-20'
    elif value > 21 and value <= 30:
        return '21-30' 
    etc....

df['user_age_bin_numeric']= df['user_age'].apply(apply_age_bin_numeric)
df['user_age_bin_string']= df['user_age'].apply(apply_age_bin_string)  

For the the model, you'll keep user_age_bin_numeric and drop user_age_bin_string

Save a copy of the data with both fields included before it goes into the model. This way you can match the predictions back to the string version of the bin fields if you want to show those instead of the numerical bins.

bbennett36
  • 6,065
  • 10
  • 20
  • 37
  • Without knowing which model OP wants to build how can you assume he cannot use categorical variables directly? Even if they cannot, simple integer encoding is most probably a rather sub-optimal solution. – m-dz Mar 01 '20 at 16:04
  • Even if he does want to use categorical variables, they still have to be numeric values. You can't put "<10" into a model.... He can change the functions to use some pre-calculated bins instead of hard coding the values in there. He didn't really give too much info about the bins so I wanted to keep it simple. – bbennett36 Mar 02 '20 at 15:57
  • Let's define what "a model" here means, as there is nothing special in e.g. decision trees handling categorical variables directly (see probably all R implementations). If by "a model" you mean model implementation, then Scikit-learn is not the only available library (and decision to not handle cat vars there is at least debatable), as you can easily pass categorical vars directly to e.g. Catboost and it will do all necessary encoding under the hood. So yes, some "models" can easily handle cat vars, or could have if not implemented to not do it. – m-dz Mar 02 '20 at 17:31
  • I'm not saying models can't handle categorical variables. I'm saying you can't put string values in them. Catboost seems to be the only exception. – bbennett36 Mar 03 '20 at 14:29