2

I want to calculate multiple linear regression with numpy. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(output for above:)

y      x1    x2    x3    x4 x5   x6  x7
20.64, 0.0,  296,  54.7, 0, 519, 2,  24.0 
25.12, 0.0,  387,  54.7, 1, 678, 2,  24.0 
19.22, 0.0,  535,  54.7, 0, 296, 2,  24.0 
18.99, 0.0,  519,  18.97, 0, 296, 2,   54.9 
18.89, 0.0,  296,  18.97, 0, 535, 2,   54.9 
25.51, 0.0,  678,  18.97, 1, 387, 2,   54.9 
20.19, 0.0,  296,  25.51,  0,  519,  2,   54.9 
20.75, 0.0,  535,  25.51,  0,  296,  2,   54.9 
24.13, 0.0,  387,  25.51,  1,  678,  2,   54.9 
19.24, 0.0,  519,  0,  0,  296,  2,   55.0 
20.90, 0.0,  296,  0,  0,  535,  2,   55.0 
25.30, 0.0,  678,  0,  1,  387,  2,   55.0 
20.78, 0.0,  296,  0,  0,  519,  2,   55.2 
23.01, 0.0,  535,  0,  0,  296,  2,   55.2 
25.20, 0.0,  387,  0,  1,  678,  2,   55.2 
19.12, 0.0,  519,  0,  0,  296,  2,   55.3 
20.03, 0.0,  296,  0,  0,  535,  2,   55.3 
25.22, 0.0,  678,  0,  1,  387,  2,   55.3

I have created this function that I think it gives the coefficients A from Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c.

def calculate_linear_regression_numpy(xx, yy):
    """ calculate multiple linear regression """
    import numpy as np
    from numpy import linalg

    A = np.column_stack((xx, np.ones(len(xx))))
    coeffs = linalg.lstsq(A, yy)[0]  # obtaining the parameters

    return coeffs

xx is a list that contains each row of x's, and yy is a list that contains all y.

The A is this:

00 = {ndarray} [   0.   296.   519.    2.    0.   24.    54.7    1. ]
01 = {ndarray} [   0.   296.   535.    2.    0.   24.    54.7    1. ]
02 = {ndarray} [   0.   387.   678.    2.    1.   24.    54.7    1. ]
03 = {ndarray} [   0.   296.   519.    2.    0.   54.9   18.97957206    1. ]
04 = {ndarray} [   0.   296.   535.    2.    0.   54.9   18.97957206    1. ]
05 = {ndarray} [   0.   387.   678.    2.    1.   54.9   18.97957206    1. ]
06 = {ndarray} [   0.   296.   519.    2.    0.   54.9   25.518085    1.   ]
07 = {ndarray} [   0.   296.   535.    2.    0.   54.9   25.518085    1.   ]
08 = {ndarray} [   0.   387.   678.    2.    1.   54.9   25.518085    1.   ]
09 = {ndarray} [   0.   296.   519.    2.    0.   55.    0.    1.]
10 = {ndarray} [   0.   296.   535.    2.    0.   55.    0.    1.]
11 = {ndarray} [   0.   387.   678.    2.    1.   55.    0.    1.]
12 = {ndarray} [   0.   296.   519.    2.    0.   55.2   0.    1. ]
13 = {ndarray} [   0.   296.   535.    2.    0.   55.2   0.    1. ]
14 = {ndarray} [   0.   387.   678.    2.    1.   55.2   0.    1. ]
15 = {ndarray} [   0.   296.   519.    2.    0.   55.3   0.    1. ]
16 = {ndarray} [   0.   296.   535.    2.    0.   55.3   0.    1. ]
17 = {ndarray} [   0.   387.   678.    2.    1.   55.3   0.    1. ]

And the np.dot(A,coeffs) is this:

[ 19.69873196  20.33871176  24.95249051  19.59198545
20.23196525  24.845744    19.41602911  20.05600891  24.66978766
20.09928377  20.73926357  25.35304232  20.09237109  20.73235089
25.34612964  20.08891474  20.72889454  25.34267329]

At the return of the function, the coeffs, contains this 8 values.

[0.0, -0.0010535377771944548, 0.039998737474281849, 0.62111016637058492, -1.0101687709958682, -0.034563440146209781, -0.026910757873959575, 0.31055508318529385]

I don't know if the coeffs[0] or the coeffs[7] is the c from the equation Y defined above.

I take this coeffs and I calculate the new Ŷ multiplying the coeffs with the new ẍ's, like this:

Ŷ=a1ẍ1 + a2ẍ2 + a3ẍ3 + a4ẍ4 + a5ẍ5 + a6ẍ6 + +a7ẍ7 + c

Am I calculating Ŷ correctly? And what should I do when I get a Ŷ with a negative number? Which term is the c (a[0] or a[7])?

xeon123
  • 819
  • 1
  • 10
  • 25
  • 1
    The `c` term would be `a[7]` since you are putting the ones column at the end, but your coefficients doesn't make sense, you can check by doing `print np.dot(A,coeffs)`, it should give you yy, or very similar. When I tried I got the coefficients `[ -0.49104607 0.83271938 0.0860167 0.1326091 6.85681762 22.98163883 -41.08437805 -19.08085066]` – Noel Segura Meraz Jan 14 '16 at 10:23
  • And what to do when I get a Ŷ negative? What it means? – xeon123 Jan 14 '16 at 12:03
  • Sorry, why does your A doesn't match your x values? – Noel Segura Meraz Jan 14 '16 at 12:14
  • I have updated the A. – xeon123 Jan 14 '16 at 12:19
  • still I don't get where do you get the first row of A, those values seem way off – Noel Segura Meraz Jan 14 '16 at 12:23
  • I get the `A` from `A = np.column_stack((xx, np.ones(len(xx))))`. I am not understanding? The A comes from the xx and yy values that I have presented here. – xeon123 Jan 14 '16 at 13:12
  • 1
    Look at the x2 and x3 values of row 00, they are 1.10224946e+09 and 4.40557880e+07, which don't appear anywhere on the first group of data you presented. Also there are 18 rows in the first data and 19 in `A` – Noel Segura Meraz Jan 14 '16 at 13:17
  • I see. I have a way off value. This is a mistake. I am going to delete it. But my question is, what to do when I get a negative `Ŷ`? What it means? – xeon123 Jan 14 '16 at 13:24
  • 1
    If you input the x's values in the right order, then it just means that that is the value your regression is calculating. What a negative Y means depends on what are you calculating. But at equation level, a negative answer is completely valid – Noel Segura Meraz Jan 14 '16 at 13:31
  • My Ŷ is represented as seconds, so if I get a negative value, it doesn't make much sense, right? – xeon123 Jan 14 '16 at 13:34
  • Do you mind if we continue this conversation in the chat?, I guess you problem has to do more with the handling of your data than with your code – Noel Segura Meraz Jan 14 '16 at 13:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/100669/discussion-between-noel-segura-and-xeon123). – Noel Segura Meraz Jan 14 '16 at 13:37

1 Answers1

1

The columns keep the order you specify them in, otherwise you would be unable to use the coefficients!

Remember, from the matrix form of the least squares problem, your estimate of Y is given by A dot C where C is your coefficient vector/matrix.

So, print out A, and it should be in the form of X1....X7 [Column of Ones].

whichever column number contains your ones, is the equivalent entry in the coefficient vector for your offset coefficient.

Just by the size of the parameters coeff[7] looks to be the offset, as it is orders of magnitude larger, which doesn't look logical as a multiplicative coefficient given the X and Y values you supplied.

Chris
  • 957
  • 5
  • 10
  • And adding the difference between the previously predicted value `Ŷ` and the real value `Y` to the new `Ŷ` in order to minimize the error that exist in the new prediction makes sense? – xeon123 Jan 14 '16 at 10:36
  • Can you add what your A matrix looks like. Also, Adding the residual does not really make sense. By definition, the model fits the least overall error to the data on the first step. What you should do is plot your residuals. If they look random, you will not get better. If the seem to have some structure, you need to look at a different model form (e.g. non linear regresssion). – Chris Jan 14 '16 at 10:45