37

I'm new to Python and trying to perform linear regression using sklearn on a pandas dataframe. This is what I did:

data = pd.read_csv('xxxx.csv')

After that I got a DataFrame of two columns, let's call them 'c1', 'c2'. Now I want to do linear regression on the set of (c1,c2) so I entered

X=data['c1'].values
Y=data['c2'].values
linear_model.LinearRegression().fit(X,Y)

which resulted in the following error

IndexError: tuple index out of range

What's wrong here? Also, I'd like to know

  1. visualize the result
  2. make predictions based on the result?

I've searched and browsed a large number of sites but none of them seemed to instruct beginners on the proper syntax. Perhaps what's obvious to experts is not so obvious to a novice like myself.

Can you please help? Thank you very much for your time.

PS: I have noticed that a large number of beginner questions were down-voted in stackoverflow. Kindly take into account the fact that things that seem obvious to an expert user may take a beginner days to figure out. Please use discretion when pressing the down arrow lest you'd harm the vibrancy of this discussion community.

seralouk
  • 30,938
  • 9
  • 118
  • 133
Dinosaur
  • 645
  • 4
  • 10
  • 14
  • It could be a data problem. It may be helpful to provide a representative sample of your csv. Separately, looking at http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html, at the bottom they create their regression object (regr = linear_model.LinearRegression()), then call rers.fit(X, Y). – Scott Apr 29 '15 at 06:23
  • Regarding your PS: I notice that many beginners questions get down voted due to not formatting their questions according to SO practices: http://stackoverflow.com/help/how-to-ask – Scott Apr 29 '15 at 06:28

5 Answers5

59

Let's assume your csv looks something like:

c1,c2
0.000000,0.968012
1.000000,2.712641
2.000000,11.958873
3.000000,10.889784
...

I generated the data as such:

import numpy as np
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt

length = 10
x = np.arange(length, dtype=float).reshape((length, 1))
y = x + (np.random.rand(length)*10).reshape((length, 1))

This data is saved to test.csv (just so you know where it came from, obviously you'll use your own).

data = pd.read_csv('test.csv', index_col=False, header=0)
x = data.c1.values
y = data.c2.values
print x # prints: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]

You need to take a look at the shape of the data you are feeding into .fit().

Here x.shape = (10,) but we need it to be (10, 1), see sklearn. Same goes for y. So we reshape:

x = x.reshape(length, 1)
y = y.reshape(length, 1)

Now we create the regression object and then call fit():

regr = linear_model.LinearRegression()
regr.fit(x, y)

# plot it as in the example at http://scikit-learn.org/
plt.scatter(x, y,  color='black')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

See sklearn linear regression example. enter image description here

Scott
  • 6,089
  • 4
  • 34
  • 51
15

Dataset

enter image description here

Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression

Importing the dataset

dataset = pd.read_csv('1.csv')
X = dataset[["mark1"]]
y = dataset[["mark2"]]

Fitting Simple Linear Regression to the set

regressor = LinearRegression()
regressor.fit(X, y)

Predicting the set results

y_pred = regressor.predict(X)

Visualising the set results

plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('mark1 vs mark2')
plt.xlabel('mark1')
plt.ylabel('mark2')
plt.show()

enter image description here

WoJ
  • 27,165
  • 48
  • 180
  • 345
Samrat Kishore
  • 151
  • 1
  • 4
  • 1
    IMHO, `X = dataset[["mark1"]]` clearer than `reshape`! – marcio Jun 08 '21 at 02:09
  • Just to mention that using `plt.plot(X, regressor.predict(X), color = 'blue')` didn't work for me, I had to use `dataset["mark1"]` instead of `dataset[["mark1"]]` for the `x` otherwise I got a `TypeError: '(slice(None, None, None), None)' is an invalid key` error. – ekke Apr 16 '23 at 19:08
8

I post an answer that addresses exactly the error that you got:

IndexError: tuple index out of range

Scikit-learn expects 2D inputs. Just reshape the X and Y.

Replace:

X=data['c1'].values # this  has shape (XXX, ) - It's 1D
Y=data['c2'].values # this  has shape (XXX, ) - It's 1D
linear_model.LinearRegression().fit(X,Y)

with

X=data['c1'].values.reshape(-1,1) # this  has shape (XXX, 1) - it's 2D
Y=data['c2'].values.reshape(-1,1) # this  has shape (XXX, 1) - it's 2D
linear_model.LinearRegression().fit(X,Y)
seralouk
  • 30,938
  • 9
  • 118
  • 133
6

make predictions based on the result?

To predict,

lr = linear_model.LinearRegression().fit(X,Y)
lr.predict(X)

Is there any way I can view details of the regression?

The LinearRegression has coef_ and intercept_ attributes.

lr.coef_
lr.intercept_

show the slope and intercept.

serv-inc
  • 35,772
  • 9
  • 166
  • 188
0

You really should have a look at the docs for the fit method which you can view here

For how to visualize a linear regression, play with the example here. I'm guessing you haven't used ipython (Now called jupyter) much either, so you should definitely invest some time into learning that. It's a great tool for exploring data and machine learning. You can literally copy/paste the example from scikit linear regression into an ipython notebook and run it

For your specific problem with the fit method, by referring to the docs, you can see that the format of the data you are passing in for your X values is wrong.

Per the docs, "X : numpy array or sparse matrix of shape [n_samples,n_features]"

You can fix your code with this

X = [[x] for x in data['c1'].values]
Tommy
  • 580
  • 4
  • 7