11

I am trying to make linear regression model that predicts the son's length from his father's length

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.linear_model import LinearRegression


Headings_cols = ['Father', 'Son']
df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt', 
                 delim_whitespace=True, names=Headings_cols)



X = df['Father']  
y = df['Son']  

model2 = LinearRegression()
model2.fit(y, X)

plt.scatter(X, y,color='g')
plt.plot(X, model.predict(X),color='g')

plt.scatter(y, X, color='r')
plt.plot(y, X, color='r')

I get error

ValueError: could not convert string to float: 'Father'

The second thing is calculating the average length of the sons, and the standard error of the mean ?

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712

3 Answers3

32

There are two main issues here:

  1. Getting the data out of the source
  2. Getting the data into the shape that sklearn.LinearRegression.fit understands

1. Getting the data out
The source file contains a header line with the column names. We do not want to column names in our data, so after reading in the whole data into the dataframe df, we can tell it to use the first line as headers by
df.head(). This allows to later query the dataframe by the column names as usual, i.e. df['Father'].

2. Getting the data into shape
The sklearn.LinearRegression.fit takes two arguments. First the "training data", which should be a 2D array, and second the "target values". In the case considered here, we simply what to make a fit, so we do not care about the notions too much, but we need to bring the first input to that function into the desired shape. This can be easily done by creating a new axis to one of the arrays, i.e. df['Father'].values[:,np.newaxis]

The complete working skript:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression

df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt',
                 delim_whitespace=True)
df.head() # prodce a header from the first data row


# LinearRegression will expect an array of shape (n, 1) 
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,) 
y = df['Son'].values


model2 = LinearRegression()
model2.fit(X, y)

plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')

plt.show()

enter image description here

ImportanceOfBeingErnest
  • 321,279
  • 53
  • 665
  • 712
  • Thank you so much for this detailed answer, helped me with my same issue. I just didn't fully understand the shaping of the data but I've made the respective edits and it's works now! Thanks! – LeleMarieC Nov 25 '17 at 19:39
2

I was looking for the answer to the same question, but the initial dataset URL is no longer valid. The "Father/Son" Pearson height dataset csv can be retrieved from the following URL and then just needs a couple of minor tweaks to work as advertised (note the renaming of the .csv file):

http://www.randomservices.org/random/data/Pearson.html

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import csv

from sklearn.linear_model import LinearRegression

# data retrieved from http://www.randomservices.org/random/data/Pearson.html#

df = pd.read_csv('./pearsons_height_data.csv',
                 quotechar='"',
                 quoting=csv.QUOTE_ALL)

df.head() # produce a header from the first data row

# LinearRegression will expect an array of shape (n, 1)
# for the "Training data"
X = df['Father'].values[:,np.newaxis]
# target data is array of shape (n,)
y = df['Son'].values

model2 = LinearRegression()
model2.fit(X, y)

plt.scatter(X, y,color='g')
plt.plot(X, model2.predict(X),color='k')

plt.show()
D-S
  • 21
  • 1
-1

When loading the data, do this instead:

df = pd.read_csv('http://www.math.uah.edu/stat/data/Pearson.txt', 
                 delim_whitespace=True)
df.columns = Headings_cols

You should also make sure X is shaped correctly:

X = df['Father'].values.reshape(-1, 1)
Alex
  • 12,078
  • 6
  • 64
  • 74
  • It gives this error ValueError: Found arrays with inconsistent numbers of samples: [ 1 1078] –  Dec 03 '16 at 03:45
  • Looks like you might be feeding the data in backwards. Try `model2.fit (X, y)` – Alex Dec 03 '16 at 04:42
  • @AlexG The problem is related to the way `LinearRegression.fit` expects it's data input. So reshaping or reversing the order of elements does not help. One needs to add a new dimension to the first input array as shown in my solution. – ImportanceOfBeingErnest Dec 03 '16 at 10:00
  • @ImportanceOfBeingErnest that is why I included this line in my solution (a couple days ago): `X = df['Father'].values.reshape(-1, 1)` – Alex Dec 04 '16 at 17:46