Linear and non-linear regression concerns

Question

I'm trying to do this polynomial regression using the scatter plot, and I have two concerns:

The red line, which is the polynomial regression appears wrong to me when compared with the plots by the data values
How can I calculate the r-square for each regression

A part of the X and Y data used (I took this data from the excel file):

The Y goes for each column that represents a specific region with total values.

x=[1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980...]

y=[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.164, 0.16499999999999998, 0.16999999999999998, 0.175, 0.17200000000000001, 0.185, 0.189, 0.195, 0.201...]

#read the data
Renew = pd.read_excel('bp-stats-review-2019-all-data.xlsx', sheet_name = 'Renewables - TWh', headers = 2, skiprows=2, usecols = range(55)).dropna(axis=0,how='all').iloc[:-10]
Renew.fillna('0',inplace=True)

#Taking only the Totals
Countries_Renew = Renew[~Renew['Terawatt-hours'].str.startswith('Total')].sort_values(['Terawatt-hours'])
Countries_Renew.set_index('Terawatt-hours', inplace=True)

#build the Linear plot regression by region
df=Countries_Renew_Total.drop(['Total World']).transpose()
n=0

for j in df.columns:
    print('The region is: '+j)
    print(n)
    for i in range(1,3):
        #import the dataset
        x=df.index.values.reshape(-1,1)
        y=df.iloc[:,int(n)].values.reshape(-1,1)

        #Fit the linear regression
        lin=LinearRegression()
        lin.fit(x,y)

        #Fit the Poly regression
        poly = PolynomialFeatures(degree = i)
        x_poly = poly.fit_transform(x)
        poly.fit(x_poly,y)
        lin2=LinearRegression()
        lin2.fit(x_poly,y)

        #Plot Poly regression
        plt.scatter(x,y,color='blue')
        plt.plot(x,lin2.predict(poly.fit_transform(x)),color='red')
        plt.title('Polynomial Regression degree '+str(i))
        plt.xlabel('Year')
        plt.ylabel('Renewable Generation (TWh)')
        plt.show()
        print(lin2.predict(poly.fit_transform([[2019]])))
        print(lin2.predict(poly.fit_transform([[2020]])))
    n=n+1

enter image description here

You said you have two concerns but listed one. Can you also post images of the plots you are concerned about? — Sean Payne, May 09 '20 at 12:12
Thanks for this flag Sean! I added the second concern and JohanC edited the post to show the pics. Thanks a lot JohanC — Tayzer Damasceno, May 09 '20 at 13:17

score 0 · Answer 1 · answered May 09 '20 at 13:33

The first graph you posted actually looks about how I would expect. The majority of the points are nearly horizontal, with a few of the rightmost points extending upwards. You have a near flat line of best fit applied which is attempting to minimize the error (which is the distance between your predictions and the actual values). Does this make sense?

It should be noted, that in order to do a linear regression on exponential data, you need to apply a log to the exponential data, which will turn it into a linear data set. Does that make sense?

Your second example is a little more confusing as I'm not familiar with the Polynomial features function, but I agree the curve does not look very accurate.

Totally agree with you, Sean! A tried to increase the degree, worked for some data, but I'll try with log to take a look if the result is better. Thank you so much. — Tayzer Damasceno, May 09 '20 at 13:58

Linear and non-linear regression concerns

1 Answers1