2

I have points with x and y coordinates I want to fit a straight line to with Linear Regression but I get a jagged looking line.

I am attemting to use LinearRegression from sklearn.

To create the points run a for loop that randomly crates one hundred points into an array that is 100 x 2 in shape. I slice the left side of it for the xs and the right side of it for the ys.

I expect to have a straight line when I print m.predict.

import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression

X = []
adder = 0
for z in range(100):
    r = random.random() * 20
    r2 = random.random() * 15
    X.append([r+adder-0.4, r2+adder])
    adder += 0.6
X = np.array(X)

plt.scatter(X[:,0], X[:,1], s=10)
plt.show()

enter image description here

m = LinearRegression()
m.fit(X[:,0].reshape(1, -1), X[:,1].reshape(1, -1))

plt.plot(m.predict(X[:,0].reshape(1, -1))[0])

enter image description here

Ant
  • 933
  • 2
  • 17
  • 33
  • 2
    Because you didn't do a regression. You asked that a curve be passed through each point. – duffymo May 11 '22 at 17:23
  • You are plotting predicted points, not the regression line. – shaik moeed May 11 '22 at 17:26
  • Thank you, but I don't know what to do differently. I followed the docs and I copied [this answer](https://stackoverflow.com/questions/40941542/using-scikit-learn-linearregression-to-plot-a-linear-fit). – Ant May 11 '22 at 17:38
  • You're reshaping wrong: `(1, -1)` says one _row_ (datapoint) and however many columns (features) are needed. You want `(-1, 1)` i.e. one _column_ and however many _rows_ are needed. – Ben Reiniger May 11 '22 at 22:40

1 Answers1

2

I am not good with numpy but, I think it is because the use of reshape() function to convert X[:,0] and X[:,1] from 1D to 2D, the resulting 2D array contains only one element, instead of creating a 2D array of len(X[:,0]) and len(X[:,1]) respectively. And resulting into an undesired regressor.
I am able to recreate this model using pandas and able to plot the desired result. Code as follows

import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
import pandas as pd
X = []
adder = 0
for z in range(100):
    r = random.random() * 20
    r2 = random.random() * 15
    X.append([r+adder-0.4, r2+adder])
    adder += 0.6
X = np.array(X)

y_train = pd.DataFrame(X[:,1],columns=['y'])
X_train = pd.DataFrame(X[:,0],columns=['X'])

//plt.scatter(X_train, y_train, s=10)
//plt.show()

m = LinearRegression()
m.fit(X_train, y_train)
plt.scatter(X_train,y_train)
plt.plot(X_train,m.predict(X_train),color='red')

enter image description here

micro5
  • 415
  • 3
  • 6
  • I was able to get it working by putting each x value in it's own individual array like so `the_x = np.array([[z] for z in X[:,0]])`. The `X[:,1]`, or `y` values remained the same. I appreciate your answer but I'm not a fan of brining pandas in. I'll give you the check if you use the `the_x = np.array([[z] for z in X[:,0]])` instead. What a time consuming wrinkle sklearn built into this thing. – Ant May 11 '22 at 18:18