0

I'm starting a project on a dataset that contains over 5k unique values for a category.

My question is, after using label encoder, to "enumerate" the categories, does it make sense to use Standard Scaler to make the data a little more "manageable" for my Machine Learning model?

Keep in mind I have over 500k entries in total and 5k unique categories for this particular column.

This is more about the intuition behind it rather than how to code it, but I figured this should be the place to ask.

keikai
  • 14,085
  • 9
  • 49
  • 68
  • https://datascience.stackexchange.com/ would probably be a better place for your question – F.S. Mar 19 '20 at 03:55

3 Answers3

0

If you use LabelEncoder for a category, you need to make sure that your category can be comparable. For example, for category ['high', 'med', 'low'], items are comparable, so it makes sense to LabelEncoding it and standard scaling it.

However, when your category cannot be compared to each other, label encoding it would not make any sense. For example, you cannot compare 'Monday' to 'Tuesday'.

TL;DR
If your category is comparable (ordinal), it makes sense. If not, try to find ways to reduce your category, there are a lot of methods to do so.

0

1) LabelEncoder is needed as your machine learning model can't handle strings. You need a sequential numeric label (0, 1, 2, .. n-1). But, it's only for the label part, you may use one-hot encoding or directly the numeric labels based on your model requirements.

2) StandardScalar makes your data zero-mean and unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected. (scikit-learn documentation)

So, usually, it helps you to scale your data well, this might be useful for faster convergence. But, again, it depends on the ML model you are using.

Zabir Al Nazi
  • 10,298
  • 4
  • 33
  • 60
0

LabelEncoder should be used for the labels, in order to have labels for n categories replaced with integers from 1 to n. You should do this if it is not already done.

StandardScaler is meant to be used, eventually, for the training and test data but nor for the labels. It outputs positive or negative float.

You should certainly not apply this to the label column, as the label column must be a positive Integer.

Catalina Chircu
  • 1,506
  • 2
  • 8
  • 19