5

I have the following dataset represented like numpy array

direccion_viento_pos

    Out[32]:

    array([['S'],
           ['S'],
           ['S'],
           ...,
           ['SO'],
           ['NO'],
           ['SO']], dtype=object)

The dimension of this array is:

direccion_viento_pos.shape
(17249, 8)

I am using python and scikit learn to encode these categorical variables in this way:

from __future__ import unicode_literals
import pandas as pd
import numpy as np
# from sklearn import preprocessing
# from matplotlib import pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Then I create a label encoder object:

labelencoder_direccion_viento_pos = LabelEncoder() 

I take the column position 0 (the unique column) of the direccion_viento_pos and apply the fit_transform() method addressing all their rows:

 direccion_viento_pos[:, 0] = labelencoder_direccion_viento_pos.fit_transform(direccion_viento_pos[:, 0]) 

My direccion_viento_pos is of this way:

direccion_viento_pos[:, 0]
array([5, 5, 5, ..., 7, 3, 7], dtype=object)

Until this moment, each row/observation of direccion_viento_pos have a numeric value, but I want solve the inconvenient of weight in the sense that there are rows with a value more higher than others.

Due to this, I create the dummy variables, which according to this reference are:

A Dummy variable or Indicator Variable is an artificial variable created to represent an attribute with two or more distinct categories/levels

Then, in my direccion_viento_pos context, I have 8 values

  • SO - Sur oeste
  • SE - Sur este
  • S - Sur
  • N - Norte
  • NO - Nor oeste
  • NE - Nor este
  • O - Oeste
  • E - Este

This mean, 8 categories. Next, I create a OneHotEncoder object with the categorical_features attribute which specifies what features will be treated like categorical variables.

onehotencoder = OneHotEncoder(categorical_features = [0])

And apply this onehotencoder to our direccion_viento_pos matrix.

direccion_viento_pos = onehotencoder.fit_transform(direccion_viento_pos).toarray()

My direccion_viento_pos with their categorized variables has stayed so:

direccion_viento_pos

array([[0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

Then, until here, I've created dummy variables to each category.

Dirección del viento categorizada

I wanted to narrate this process, to arrive at my question.

If these dummy encoder variables already in a 0-1 range, is necessary apply the MinMaxScaler feature scaling?

Some say that it is not necessary to scale these fictitious variables. Others say that if necessary because we want accuracy in predictions

I ask this question due to when I apply the MinMaxScaler with the feature_range=(0, 1) my values have been changed in some positions ... despite to still keep this scale.

What is the best option which can I have to choose with respect to my dataset direccion_viento_pos

bgarcial
  • 2,915
  • 10
  • 56
  • 123
  • In [this post question](https://discuss.analyticsvidhya.com/t/dummy-variables-is-necessary-to-standardize-them/66867/3) I've received the orientation about it. There is a difference between when use `LabelEncoder` and when use `OneHotEncoder`, in my question above I am using togethers and I get the expected result that is the **codification** (with `LabelEncoder` ) and categorization (with `OneHotEncoder` ) treat them these values like a categorical values avoidig the weight inconvenient in relation to these values. – bgarcial May 31 '18 at 15:47
  • 1
    But, there is the [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function which: > Convert categorical variable into dummy/indicator variables .. making this without apply `LabelEncoder` and `OneHotEncoder`. Is more efficient. – bgarcial May 31 '18 at 15:49

1 Answers1

5

I don't think scaling them will change the answer at all. They're all on the same scale already. Min 0, max 1, range 1. If some continuous variables were present, you'd want to normalize the continuous variables only, leaving the dummy variables alone. You could use the min-max scaler to give those continuous variables the same minimum of zero, max of one, range of 1. Then your regression slopes would be very easy to interpret. Your dummy variables are already normalized.

Here's a related question asking if one should ever standardize binary variables.

Sean McCarthy
  • 4,838
  • 8
  • 39
  • 61