0

Firstly, i apologize for a vague question . Let me explain. I have a pandas dataframe containing 2 columns namely square feet and number of bedrooms. I am trying to compute the price using linear regression and want to run the matrix to compute Gradient Descent. Since square feet are 1000 times larger than number of bedrooms, and Gradient Descent does not converge nicely , I am trying to handle this scale variance in attributes by normalizing.

The particular normalization I am doing is to subtract the individual column cell for bedrooms and squarefeet by their respective mean and divide the result by their respective standard deviation. The code I have written is this:-

  meanb= X[['bedrooms']].mean()
  meanFeet=X[['sqrfeet']].mean()
  stdb=X[['bedrooms']].std()
  stdFeet=X[['sqrfeet']].std()

  norb=lambda x: (x-meanb)/stdb
  nors=lambda x: (x-meanFeet)/stdFeet

  X['bedrooms']=X['bedrooms'].apply(norb)
  X['sqrfeet']= X['sqrfeet'].apply(nors)

The question is there an easier way of doing this as this won't scale if I have 1000's of columns. I am wondering if there is a dataframe.applymap() method that would compute the mean and std for respective individual column and execute the normalization on respective cells for each column. Note that each of the column can have different ranges of values but are all numeric.

sunny
  • 643
  • 2
  • 11
  • 29

2 Answers2

0

Suppose

1.the price listed to the first column and

2.you wanna standardize all the columns except price column

from sklearn import preprocessing 
import numpy as np


X, y = df.iloc[:,2:].values, df.iloc[:,1].values     
scaler = preprocessing.StandardScaler().fit(X)  
scaler.transform(X)

OR

STD = lambda x: (x-x.mean())/x.std()
Takatjuta
  • 77
  • 1
  • 9
0

Thanks for your help. I learning that there are many ways of doing this. Actually, the way I solved it as follows. You can selectively use apply/lambda function to individual labeled columns. for example if I am normalizing using mean and max, I used the following sample code: ( please note , I am not sharing my full code here):

  sqrftMax=data['sqrfeet'].max()
  sqrftMean=data['sqrfeet'].mean()

  #normalized list of sqrfootage.
  nSqrft= data['sqrfeet'].apply(lambda x: (x-sqrftMean)/sqrftMax)
  data['sqrfeet'] =nSqrft
sunny
  • 643
  • 2
  • 11
  • 29