7

I am rying to use the label encoder in orrder to convert categorical data into numeric values.

I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I would like to use a mask to replace form the original data frame after labelling like this

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})


    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN


dfTmp = df
mask = dfTmp.isnull()

       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

So I get a dataframe with True/false value

Then , in create the encoder :

df = df.astype(str).apply(LabelEncoder().fit_transform)

How can I proceed then, in orfer to encoder these values?

thanks

Sreekiran A R
  • 3,123
  • 2
  • 20
  • 41
Nasri
  • 525
  • 1
  • 10
  • 22

1 Answers1

12

The first question is: do you wish to encode each column separately or encode them all with one encoding?

The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.

That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.

Drawbacks
First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.

df.dtypes
A    float64
B      int64
C    float64
dtype: object

It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.

The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.

A simple, explicit solution is:

encoders = dict()

for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder

print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

- more code, but result is the same

print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}

- also, encoders are available. Inverse transform (should drop nan's before!) too:

encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])

Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.

How it works

The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.

Expression step by step:

pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)

- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:

print(df)
Out:
     A  B    C
0    x  1  2.0
1  NaN  6  1.0
2    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  1.0  1  0.0
2  NaN  2  NaN

- values shift from correct positions - and even an IndexError may occur.

Single encoder for all columns

That case, stack dataframe, fit encodet, then unstack it

series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN

- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it.

Hope it helps.

Mikhail Stepanov
  • 3,680
  • 3
  • 23
  • 24
  • Thank you for all these explanation!! i appreciate – Nasri Jan 31 '19 at 10:44
  • 2
    I'm glad to help :) – Mikhail Stepanov Jan 31 '19 at 10:47
  • @MikhailStepanov this is great. I have a question. Assuming Column B in the original df was numeric and didn't need encoding does this pose a problem that its values are changed from `1,6,9` to `0.0, 3.0, 4.0` respectively after the last solution? Or does it not matter because the relationships between the values of column B stay the same but are just encoded with label encoding? Thanks – Justin Benfit May 11 '22 at 23:08
  • 1
    Hi @Justin Benfit, if I got you right, it doesn't matter. It's just bijective mapping from one set of values to another (specifically, from 0 to n-1, where n is the number of classes). To be honest, reading it now, I find the latter example I wrote, and you referred to a bit misleading. Because in this example, the label encoder is trained on all values from the data frame simultaneously. Hence, float values for all the columns and range of the columns from 1.0 to 6.0. The more appropriate example is the former - a separate encoder for each column. Here we have mapping {1, 6, 9} -> {0, 1, 2}. – Mikhail Stepanov May 13 '22 at 13:46
  • 1
    But basically, label encoding is not for encoding numbers but more for encoding strings or something like this. From its name, the label encoder is for encoding labels (Y's), so the latter example is irrelevant. It's unlikely that all the labels are located in different columns (and if one uses it to encode predictors, it's better to one-hot or target-encode them as it produces a suitable representation for most models). – Mikhail Stepanov May 13 '22 at 13:47
  • 1
    Moreover, one can fit classifier models directly on a string/categorical data, so a label encoder is not a necessary thing to use. According to the documentation, it "is sometimes useful for writing efficient Cython routines." But the model can treat string labels for you; just impute NaNs, and that's it. – Mikhail Stepanov May 13 '22 at 13:47