LabelBinarizer for multiple columns in data frame

Question

I have a csv file which has 25 columns some are numeric and some are categorical and some are like names of actors, directors. I want use regression models on this data. In order to do so I have to convert the categorical columns string types to numeric values using LabelBinarizer from scikit package. How can I use LabelBinarize on this dataframe which has multiple categorical data?

SampleData

Essentially I want to binarize the labels and add them to the dataframe.

In the below code, I have retrieved the list of the columns I want to binarize not able to figure out how to add the new column back to the df?

categorylist = ['color', 'language', 'country', 'content_rating']
for col in categorylist:
    tempdf = label_binarizer.fit_transform(df[col])

In the next step, I want add the tempdf to df and drop the original column df[col].

Is `df` in your code a pandas dataframe? Please notice that the output of `sklearn` methods (like `fit_transform` in your code) is a numpy array! As such, `tempdf` in your code is not a `Pandas` dataFrame! First, you need to convert it to a dataframe (for instance `newdf = pd.DataFrame(tempdf)`) and then concat it to your `df`. Also you can delete the columns using `del df['column_name']`. One last comment is to make sure whether you need `LableBinarizer` or `MultiLabelBinarizer`. — MhFarahani, Nov 07 '16 at 22:13

score 8 · Answer 1 · edited May 23 '17 at 12:25

8

You can do this in a one-liner with pd.get_dummies:

tempdf = pd.get_dummies(df, columns=categorylist)

Otherwise you can use a FeatureUnion with FunctionTransformer as in the answer to sklearn pipeline - how to apply different transformations on different columns

EDIT: As added by @dukebody in the comments, you can also use the sklearn-pandas package which purpose is to be able to apply different transformations to each dataframe column.

edited May 23 '17 at 12:25

Community

1
1

answered Nov 07 '16 at 22:12

maxymoo

35,286
11
92
119

2

You can also use the sklearn-pandas package which purpose is to be able to apply different transformations to each dataframe column. – dukebody Nov 09 '16 at 09:51
@dukebody this looks very handy ! – maxymoo Nov 09 '16 at 22:15

LabelBinarizer for multiple columns in data frame

1 Answers1