0

I am feature scaling my data before logistic regression.

Everything works perfect until I attempt to divide the columns by the max_min vector. It seems to have worked in each column but not the age column, but I cant seem to find why.

I have previously split the data for testing and training and below I am attempting to scale the X_train data.

# Working out the min value for each column and subtracting this from each row in the data
X_train_min = np.array(X_train0.min())
X_train0.sub(X_train_min.squeeze(), axis=1)

From the code above I obtain a table where each value has had the minimum value of its column subtracted, which is correct. Output: output

# Working out the max value for each column and the difference between the max and min values
X_train_max = np.array(X_train0.max())
max_min = np.array(X_train0.max()) - np.array(X_train0.min())
print(max_min)

Output:

[   56     1     3     2     4     3 18174    56     7]

Here is where I face a problem:

# Dividing each row in the data by the difference between the max and min values of its column
X_train0.div(max_min, axis=1)

I have obtained a table where each value has been divided by the vector, apart from the first column 'Age' where the numbers do not correspond to the division. Output: output

1 Answers1

0

You are dividing by max - min when min is already subtracted. All you need is just to divide by the new max:

max_min = np.array(X_train0.max())
Marat
  • 15,215
  • 2
  • 39
  • 48
  • The formula I found for feature scaling is (x-min)/(max-min) – Rebecca Stephens Nov 20 '20 at 17:53
  • @RebeccaStephens it is true only if you take both max and min before applying the formula. From your code, it looks like you calculate max after subtracting min – Marat Nov 20 '20 at 18:49