1

Here is a small data frame that contains a very small slice of data that I need to encode. DataFrame to Encode

My current effort in doing this is using SciKit-Learns LabelEncoder(),

le = preprocessing.LabelEncoder()
le.fit(["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"])
list(le.classes_)

(output) 
['animals',
 'diet',
 'dogs',
 'food',
 'health',
 'kids',
 'local',
 'police',
 'police brutality']

I have now added all my desired targets to the encoder, so now I need to start encoding. The problem is the LabelEncoder takes arguments like this.

le.transform(["local"]) #For the first row in the data frame
(output) array([6])

Now thats the correct encoding for the first row, but how would I do this for every other row? I don't think writing it by hand is very doable as my actual data set is about 6000 samples.

I'm also not sure if the targets should be comma separated or not, I can always change that, but my end goal is to get a new data frame with encoded labels instead of the categorial labels.

Also, since the encoder returns a single array, if I were to do the same things for every row, each with a different amount of labels (i.e (dogs, animals) instead of (local)), I would need to append every array to make a matrix, but that is also something I have no clue how I should do. Thanks so much for the help!

3 Answers3

1

I think you'd want MultiLabelBinarizer. For sklearn multilabel models anyway, the expected format is the boolean array rather than an array of lists of integers.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
0

I've had this issue recently too, here's a function that does the job:

from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

And here's how you'd use it on your example

    dataset = MultiColumnLabelEncoder(columns = ["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"]).fit_transform(dataset)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Is this significantly different from using `OrdinalEncoder`? – Ben Reiniger Aug 05 '20 at 21:31
  • Thanks for the help! Unfortunately, I tried it and it gave me a KeyError: 'local'. My data frame has only one column called 'tags', with local being one of those tags. Could that be the problem? Even if I define the columns = ['tags'], it returns a new data frame with one number per row, not what I want. – Alborz Gharabaghi Aug 06 '20 at 00:05
0

try this

from sklearn.preprocessing import MultiLabelBinarizer


mlb = MultiLabelBinarizer()

mlb.fit([["local", "animals", "local", "diet", "food", "health", "local", "police brutality", "police", "kids", "dogs"]])
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
elyte5star
  • 167
  • 5