-2

I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:

Preg    Glucose     BP     SkinThickness    Insulin     BMI    Pedigree Age Outcome
0   1   148.0   72.000000   35.00000    155.548223  33.600000   0.627   50  1
1   1   85.0    66.000000   29.00000    155.548223  26.600000   0.351   31  0
2   1   183.0   64.000000   29.15342    155.548223  23.300000   0.672   32  1
3   1   89.0    66.000000   23.00000    94.000000   28.100000   0.167   21  0
4   0   137.0   40.000000   35.00000    168.000000  43.100000   2.288   33  1
5   1   116.0   74.000000   29.15342    155.548223  25.600000   0.201   30  0

The description:

df.describe()
      Preg       Glucose        BP        SkinThickness  Insulin     BMI    Pedigree    Age
count768.000000 768.000000  768.000000  768.000000  768.000000  768.000000  768.000000  768.000000
mean0.855469    121.686763  72.405184   29.153420   155.548223  32.457464   0.471876    33.240885
std 0.351857    30.435949   12.096346   8.790942    85.021108   6.875151    0.331329    11.760232
min 0.000000    44.000000   24.000000   7.000000    14.000000   18.200000   0.078000    21.000000
25% 1.000000    99.750000   64.000000   25.000000   121.500000  27.500000   0.243750    24.000000
50% 1.000000    117.000000  72.202592   29.153420   155.548223  32.400000   0.372500    29.000000
75% 1.000000    140.250000  80.000000   32.000000   155.548223  36.600000   0.626250    41.000000
max 1.000000    199.000000  122.000000  99.000000   846.000000  67.100000   2.420000    81.000000

The description of normalization from the paper is as follows:

As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below: V'=V-Y/Z where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation

 z=scipy.stats.zscore(df)

But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].

  • Your formula _standardises_ the values, which is not the same as forcing them to the range [0, 1]. Do you have to do it 'manually', for any reason? If not, have a look at `sklearn`'s `MinMaxScaler`, the documentation for which is here: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler.inverse_transform – Chris Apr 07 '20 at 17:15
  • 1
    Welcome to stack overflow! Please show a [mcve] including sample data and your code so that we can better understand how to help – G. Anderson Apr 07 '20 at 17:16
  • @Chris I am trying to implement a paper using PIMA Indians Diabetic dataset. In the paper, they are using z-score normalization. It is mentioned as _As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our..._ –  Apr 07 '20 at 17:27
  • Please provide a [mcve]. – AMC Apr 07 '20 at 19:35
  • @AMC Have edited the question. –  Apr 07 '20 at 20:19
  • @G.Anderson Thank you. Have edited the question with the data. –  Apr 07 '20 at 20:20

3 Answers3

1

There are several points to note here.

Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.

Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.

Set up data

import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')

# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]

$ print(df.head())

   a    b   c   d    e     f      g   h  i
0  1   85  66  29    0  26.6  0.351  31  0
1  8  183  64   0    0  23.3  0.672  32  1
2  1   89  66  23   94  28.1  0.167  21  0
3  0  137  40  35  168  43.1  2.288  33  1
4  5  116  74   0    0  25.6  0.201  30  0

Standardisation


# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")

Min: -4.055 Max: 845.307

As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.

Min-max scaling

min_max = (df - df.values.min()) / (df.values.max() - df.values.min())

# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")

Min: 0.000 Max: 1.000

Here we do indeed get values in [0, 1].

Discussion

These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:

  1. There are fewer chances of making a mistake as you have to do less typing.
  2. sklearn will be at least as computationally efficient and often more so.
  3. You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
  4. If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).

I'm sure there are other reasons, but these are the main ones that come to mind.

Chris
  • 1,618
  • 13
  • 21
0

Your standardization formula hasn't the aim of putting values in the range [0, 1].

If you want to normalize data to make it in such a range, you can use the following formula :

z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)

And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.

Lahcen YAMOUN
  • 657
  • 3
  • 15
0

Assuming your original dataframe is df and it has no invalid float values, this should work

df2 = (df - df.values.min()) / (df.values.max()-df.values.min())
Ben
  • 51
  • 5