1

I'm new in python and my problem is that I have a given set of data:

import numpy as np

x=np.arange(1,5)
y=np.arange(5,9)

My problem is to find a number n (not necessarily an integer) that will give me the highest value of R^2 value when I plot y^n vs x. I'm thinking of generating n for example as:

n=np.linspace(1,9,100)

I don't know how to execute my idea. My other means is to resort to brute force of generating n and raising y for each value of n. After getting that value (let's say y1), I will plot y1 vs x (which means I have to generate 100 plots. But I have no clue on how to get the R^2 value ( for a linear fit) for a given plot.

What I want to do is to have a list (or array) of R^2 values:

R2= np.array() #a set containing the R^2 values calculated from the plots

and find the max value on that array and from there, find the plot that gave that R^2 value thus I will find a particular n. I don't know how to do this.

Damian Yerrick
  • 4,602
  • 2
  • 26
  • 64
Justin
  • 87
  • 1
  • 10
  • Just to clarify, What does R^2 represents? – FortMauris Oct 31 '14 at 01:27
  • @FortMauris R^2 gives the correlation between two quantities. If I plot y vs x and got an R^2 value of 1 (max value), it means that they're correlated with each other. The closer R^2 to a value of 1, the better – Justin Oct 31 '14 at 01:37
  • R² is also called [the coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) – Mike T Oct 31 '14 at 02:35

1 Answers1

0

If you are able to use the pandas library, this problem is very easy to express:

import pandas
import numpy as np

x = pandas.Series(np.arange(1,5))
y = pandas.Series(np.arange(5,9))
exponents = exponents = np.linspace(1, 9, 100)

r2s = {n:pandas.ols(x=x, y=y**n).r2 for n in exponents}
max(r2s.iteritems(), key=lambda x: x[1])
#>>> (1.0, 1.0)

Breaking this down:

  1. the pandas.Series object is an indexed column of data. It's like a numpy array, but with extra features. In this case, we only care about it because it is something we can pass to pandas.ols.
  2. pandas.ols is a basic implementation of least-squares regression. You can do this directly in numpy with numpy.linalg.lstsq, but it won't directly report the R-squared values for you. To do it with pure numpy, you'll need to get the sum of squared residuals from numpy's lstsq and then perform the formulaic calculations for R-squared manually. You could write this as a function for yourself (probably a good exercise).
  3. The stuff inside the {..} is a dict comprehension. It will iterate over the desired exponents, perform the ols function for each, and report the .r2 attribute (where the R-squared statistic is stored) indexed by whatever exponent number was used to get it.
  4. The final step is to call max on a sequence of the key-value pairs in r2s, and key tells max that it is the second element (the R-squared) by which elements are compared.

An example function to do it with just np.linalg.lstsq is here (good explanation for calculating R2 in numpy):

def r2(x, y):
    x_with_intercept = np.vstack([x, np.ones(len(x))]).T
    coeffs, resid = np.linalg.lstsq(x_with_intercept, y)[:2]
    return 1 - resid / (y.size * y.var())[0]

Then in pure numpy the above approach:

import numpy as np

x = np.arange(1,5)
y = np.arange(5,9)
exponents = np.linspace(1, 9, 100)

r2s = {n:r2(x=x, y=y**n) for n in exponents}
max(r2s.iteritems(), key=lambda x: x[1])
#>>> (1.0, 1.0)

As a final note, there is a fancier way to specify getting the 1-positional item from something. You use the built-in library operator and the callable itemgetter:

max(..., key=operator.itemgetter(1))

The expression itemgetter(1) results in an object that is callable -- when it is called on an argument r it invokes the __getitem__ protocol to result in r[1].

Community
  • 1
  • 1
ely
  • 74,674
  • 34
  • 147
  • 228
  • Hi, sorry I think I made a mistake with my question. What I want to do is to guess n so that I will get an R^2 value of 1 (which is the highest value). I tried making sample data in excel and got the following result for a linear trend: x= (1,2,3,4) y= (1,2,3,4) R^2= 1 and x= (1,2,3,4) y= (1,4,9,16) R^2= 0.969 ... which is what your code is giving. It gives the R^2 value for n=1. For instance for the 2nd set of data, it will have a R^2= 1 value if n=2. Sorry for the wrong idea for my question. – Justin Nov 02 '14 at 11:45