1

I have Normalize my data and apply regression analysis to predict yield(y). but my predicted output also gives in normalized (in 0 to 1) I want my predicted answer in my correct data numbers,not in 0 to 1.

Data:

Total_yield(y)    Rain(x)  
      64799.30   720.1  
      77232.40   382.9  
      88487.70  1198.2  
      77338.20   341.4  
      145602.05   406.4 
      67680.50   325.8 
      84536.20   791.8 
      99854.00   748.6 
      65939.90  1552.6 
      61622.80  1357.7
      66439.60   344.3 

Next,I have normalize data using this code :

from sklearn.preprocessing import Normalizer
import pandas
import numpy
dataframe = pandas.read_csv('/home/desktop/yield.csv')
array = dataframe.values
X = array[:,0:2]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
print(normalizedX)

     Total_yield      Rain
0       0.999904  0.013858
1       0.999782  0.020872
2       0.999960  0.008924
3       0.999967  0.008092
4       0.999966  0.008199
5       0.999972  0.007481
6       0.999915  0.013026
7       0.999942  0.010758
8       0.999946  0.010414
9       0.999984  0.005627
10      0.999967  0.008167

Next, I use this normalize value to calculate R-sqaure using following code :

array=normalizedX
data = pandas.DataFrame(array,columns=['Total_yield','Rain'])
import statsmodels.formula.api as smf
lm = smf.ols(formula='Total_yield ~ Rain', data=data).fit()
lm.summary()

Output :

<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Total_yield   R-squared:                       0.752
Model:                            OLS   Adj. R-squared:                  0.752
Method:                 Least Squares   F-statistic:                     1066.
Date:                Thu, 09 Feb 2017   Prob (F-statistic):          2.16e-108
Time:                        14:21:21   Log-Likelihood:                 941.53
No. Observations:                 353   AIC:                            -1879.
Df Residuals:                     351   BIC:                            -1871.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      1.0116      0.001    948.719      0.000         1.009     1.014
Rain          -0.3013      0.009    -32.647      0.000        -0.319    -0.283
==============================================================================
Omnibus:                      408.798   Durbin-Watson:                   1.741
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            40636.533
Skew:                          -4.955   Prob(JB):                         0.00
Kurtosis:                      54.620   Cond. No.                         10.3
==============================================================================

Now, R-square = 0.75 ,

regression model : y =  b0 + b1  *x

Yield  =  b0 + b1 * Rain

Yield  =  intercept + coefficient for Rain * Rain

Now when I use my data value for Rain data then it will gives this answer :
Yield  =  1.0116    + ( -0.3013 * 720.1(mm)) = -215.95

-215.95yield is wrong, 

 And when I use normalize value for rain data then predicted yield comes in normalize value in between 0 to 1.

 I want predict if rainfall will be 720.1 mm then how many yield will be there? 

If anyone help me how to get predicted yield ? I want to compare  Predicted yield vs given yield.
Kiran Prajapati
  • 191
  • 2
  • 18

1 Answers1

6

First, you should not use Normalizer in this case. It doesn't normalize across features. It does it along rows. You may not want it.

Use MinMaxScaler or RobustScaler to scale each feature. See the preprocessing docs for more details.

Second, these classes have a inverse_transform() function which can convert the predicted y value back to original units.

x = np.asarray([720.1,382.9,1198.2,341.4,406.4,325.8,
                791.8,748.6,1552.6,1357.7,344.3]).reshape(-1,1)
y = np.asarray([64799.30,77232.40,88487.70,77338.20,145602.05,67680.50,
              84536.20,99854.00,65939.90,61622.80,66439.60]).reshape(-1,1)

scalerx = RobustScaler()
x_scaled = scalerx.fit_transform(x)

scalery = RobustScaler()    
y_scaled = scalery.fit_transform(y)

Call your statsmodel.OLS on these scaled data. While predicting, first transform your test data:

x_scaled_test = scalerx.transform([720.1])

Apply your regression model on this value and get the result. This result of y will be according to the scaled data.

Yield_scaled  =  b0 + b1 * x_scaled_test

So inverse transform it to get data in original units.

Yield_original = scalery.inverse_transform(Yield_scaled)

But in my opinion, this linear model will not give much accuracy, because when I plotted your data, this is the result.Rain(x) Yield(y) plot

This data will not be fitted with linear models. Use other techniques, or get more data.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • When I use MinMax scaler & standardize then I will get R-square : 0.07 , And when I have use then I will get R-square : 0.75 , that why I choose, normalize, okay i will use RobustScaler , – Kiran Prajapati Feb 09 '17 at 10:28
  • I am not understand this meaning " inverse_transform() function which can convert the predicted y value back to original units ". can you explain me please it will helps for me – Kiran Prajapati Feb 09 '17 at 10:30
  • 1
    It means that it will invert the scaling, and get the original value back from the scaled value – Vivek Kumar Feb 09 '17 at 11:52
  • I have using RobustScaler, but i am confuse in after I get slope and intercept value, can i use same value for original data value ? I mean if I have get slope = 0.58 and intercept = 0.89 using robust scaler then I have to use same for predict yield ? – Kiran Prajapati Feb 09 '17 at 12:08
  • 1
    You `transform` your test data according to the same scaler you used during training, then predict and then `inverse_transform` the result back to get answer in original units. I will add code to show. – Vivek Kumar Feb 09 '17 at 12:11
  • It will very helpful ,if you give code for that.Thanks – Kiran Prajapati Feb 09 '17 at 12:14
  • 1
    If you are satisfied, close the question and / or accepting the answer. If you want more help, either edit the question or post a new question. – Vivek Kumar Feb 09 '17 at 12:42
  • Thanks for your help. – Kiran Prajapati Feb 09 '17 at 12:46
  • when I use x_scaled_test = scalerx.transform([720.1]) this gives same number x_scaled for 720.1 , and for b0 and b1 I have use same method what I have use in my code? – Kiran Prajapati Feb 09 '17 at 12:59
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/135278/discussion-between-vivek-kumar-and-kiran-prajapati). – Vivek Kumar Feb 09 '17 at 13:28