12

Problem

Am looking to apply the y = mx + b equation (where m is SLOPE, b is INTERCEPT) to a data set, which is retrieved as shown in the SQL code. The values from the (MySQL) query are:

SLOPE = 0.0276653965651912
INTERCEPT = -57.2338357550468

SQL Code

SELECT
  ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

  ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
  (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,
FROM
(SELECT
  D.AMOUNT,
  Y.YEAR
FROM
  CITY C, STATION S, YEAR_REF Y, MONTH_REF M, DAILY D
WHERE
  -- For a specific city ...
  --
  C.ID = 8590 AND
  -- Find all the stations within a 15 unit radius ...
  --
  SQRT( POW( C.LATITUDE - S.LATITUDE, 2 ) + POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND
  -- Gather all known years for that station ...
  --
  S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND
  -- The data before 1900 is shaky; insufficient after 2009.
  --
  Y.YEAR BETWEEN 1900 AND 2009 AND
  -- Filtered by all known months ...
  --
  M.YEAR_REF_ID = Y.ID AND
  -- Whittled down by category ...
  --
  M.CATEGORY_ID = '001' AND
  -- Into the valid daily climate data.
  --
  M.ID = D.MONTH_REF_ID AND
  D.DAILY_FLAG_ID <> 'M'
  GROUP BY Y.YEAR
  ORDER BY Y.YEAR
) t

Question

The following results (to calculate the start and end points of the line) appear incorrect. Why are the results off by ~10 degrees (e.g., outliers skewing the data)?

(1900 * 0.0276653965651912) + (-57.2338357550468) = -4.66958228

(2009 * 0.0276653965651912) + (-57.2338357550468) = -1.65405406

(Note that the data no longer match the image; the code.)

I would have expected the 1900 result to be around 10 (not -4.67) and the 2009 result to be around 11.50 (not -1.65).

Related Sites

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315

2 Answers2

1

This has now been verified as correct:

SELECT
  ((sum(t.YEAR) * sum(t.AMOUNT)) - (count(1) * sum(t.YEAR * t.AMOUNT))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as SLOPE,

  ((sum( t.YEAR ) * sum( t.YEAR * t.AMOUNT )) -
  (sum( t.AMOUNT ) * sum(power(t.YEAR, 2)))) /
  (power(sum(t.YEAR), 2) - count(1) * sum(power(t.YEAR, 2))) as INTERCEPT,

  ((avg(t.AMOUNT * t.YEAR)) - avg(t.AMOUNT) * avg(t.YEAR)) /
  (stddev( t.AMOUNT ) * stddev( t.YEAR )) as CORRELATION
FROM (
  SELECT
    AVG(D.AMOUNT) as AMOUNT,
    Y.YEAR as YEAR
  FROM
    CITY C,
    STATION S,
    YEAR_REF Y,
    MONTH_REF M,
    DAILY D
  WHERE
    C.ID = 8590 AND

    SQRT(
      POW( C.LATITUDE - S.LATITUDE, 2 ) +
      POW( C.LONGITUDE - S.LONGITUDE, 2 ) ) < 15 AND

    S.STATION_DISTRICT_ID = Y.STATION_DISTRICT_ID AND

    Y.YEAR BETWEEN 1900 AND 2009 AND

    M.YEAR_REF_ID = Y.ID AND

    M.CATEGORY_ID = '001' AND

    M.ID = D.MONTH_REF_ID AND
    D.DAILY_FLAG_ID <> 'M'
  GROUP BY
    Y.YEAR
) t

See the image for details on slope, intercept, and (Pearson's) correlation.

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
  • 4
    Congratulations. Maybe someone should talk to you about R http://www.r-project.org/ You could do it with a one-liner – gd047 May 10 '10 at 09:52
  • Keep in mind that response variables in regression are assumed to be independent of each other. This assumption is not appropriate sometimes, if the data are collected over time (which is true in your case) and the observations that are taken close together are related. Lack of independence causes no bias in least squares estimates of the coefficients, but standard errors are seriously affected. – gd047 May 11 '10 at 11:11
  • google for "linear regression assumptions" independence "serial correlation" i.e. http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html#Lack%20of%20independence – gd047 May 11 '10 at 16:29
  • If you do things in R it will be easy for you to try more complicated models. Autoregressive models would probably be useful here. – Jonathan Chang May 11 '10 at 20:25
1

Try to split up the function, you have miscalculated the parameters. Have a look here for reference.

I would do something like the following (please excuse the fact that I don't remember much about SQL syntax and temporary variables, so the code might actually be wrong):

SELECT

sum(t.YEAR) / count(1) AS avgX,

sum(t.AMOUNT) / count(1) AS avgY,

sum(t.AMOUNT*t.YEAR) / count(1) AS avgXY,

sum(power(t.YEAR, 2)) / count(1) AS avgXsq,

( avgXY - avgX * avgY ) / ( avgXsq - power(avgX, 2) )  as SLOPE,

avgY - SLOPE * avgX as INTERCEPT,
NeXuS
  • 1,797
  • 1
  • 13
  • 13