-1

I have a dataset full of categorical values that are not encoded at the moment. For instance, I have a variable called condition which has these values: Very Excellent, Excellent, Very Good

I want to encode these (give them integer values) so that I can use them as categorical dummy variables in a regression. However, I have lots of these in my Data Frame so I'd like to iterate over each column and encode all dtype objects. This is my attempt:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
enc=LabelEncoder()

for column in df_06:
    if df_06["column"].values==object:
        df_06["column"]=enc.fit_transform(df_06["column"])

My dataframe is

my df

Error:

<ipython-input-48-ea6aec86108f> in <module>()
1 for column in df_06:
----> 2 if df_06[column].values==object:
3 df_06[column]=enc.fit_transform(df_06[column])
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
J.DF
  • 23
  • 7
  • 1
    What's the problem ? What does it return ? – Maxouille Mar 08 '19 at 09:44
  • Are you encoding them as integers or as dummy-variables (aka One-Hot-Encoding)? These are 2 different techniques. The example you've given appears to be ordinal and would be best encoded with your own mapping. eg `{'Very Good': 0, 'Excellent': 1, 'Very Excellent': 2}`. `LabelEncoder` would not guarantee the correct order – Chris Adams Mar 08 '19 at 09:48
  • I'd like to encode them as dummies. What is a quick way of mapping them without having do it manually? – J.DF Mar 08 '19 at 10:02
  • in () 1 for column in df_06: ----> 2 if df_06[column].values==object: 3 df_06[column]=enc.fit_transform(df_06[column]) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() – J.DF Mar 08 '19 at 10:04

2 Answers2

3

That for loop has a lot of errors. For example, pd["column"] will not call the value column. Also, you are trying to compare the full column to the single value 'object' (the error you reported in the comments).

For your problem, you can use

 for column in df.select_dtypes(include=['whatever_type_you_want']):
    df[column], _ = pd.factorize(df[column])

select_dtypes can also accept exclude as an argument.

micric
  • 621
  • 4
  • 15
  • Thanks. What is the appropriate way to call the column? – J.DF Mar 08 '19 at 10:22
  • @J.DF you have to remove the "": df06[column] – micric Mar 08 '19 at 10:26
  • Thanks. So the new loop would look something like this: for column in df_06: df_06[column].factorize – J.DF Mar 08 '19 at 10:34
  • factorize actually returns two values (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html), so you have to do something like: for column in df: df[column], _ = pd.factorize(df[column]) This will encode every column separately tho, not sure, if that's ok for you. – micric Mar 08 '19 at 10:51
  • The problem with that is that it encodes integer values too. that's why I was using if df_06[column].values==object. Also, what does ,_ do? – J.DF Mar 08 '19 at 10:55
  • Without an example of what your df looks like it's prett hard to give a complete answer. If you post an example it'd be easier. Regarding the latest question: ,_ is needed because factorize returns two variables. That way we are basically saying to assign only the first variable to df[column] – micric Mar 08 '19 at 11:02
  • I've added a caption of my df to my original post. MSZone is one of the variables I want to encode, the others are integers which I do not want to encode – J.DF Mar 08 '19 at 11:30
  • You could use df.select_dtypes (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html) – micric Mar 08 '19 at 12:18
  • @J.DF so the loop should be for column in df.select_dtypes(include=['whatever_type_you_want']): etc etc – micric Mar 08 '19 at 12:30
1

Before encoding, make sure your columns are represented as category:

df_06[list_of_columns_to_encode].apply(lambda col: col.astype('category'))
  1. Now if you want to one-hot encode, why not use pd.get_dummies directly?

    pd.get_dummies(df_06, columns=[list_of_columns_to_encode])
    
  2. If you want to use LabelEncoder then try something like this:

    le = LabelEncoder()
    df_06[list_of_columns_to_encode].apply(le.fit_transform)
    

    Refer this answer if you want to know more about how to transform future data using the same LabelEncoder fitted dictionary.

panktijk
  • 1,574
  • 8
  • 10