15

I have some data in a sql database and I'd like to calculate the slope. The data has this layout:

Date        |  Keyword  |  Score    
2012-01-10  |  ipad     |  0.12    
2012-01-11  |  ipad     |  0.17    
2012-01-12  |  ipad     |  0.24    
2012-01-10  |  taco     |  0.19    
2012-01-11  |  taco     |  0.34    
2012-01-12  |  taco     |  0.45    

I'd like the final output to look like this by creating a new table using SQL:

Date        |  Keyword  |  Score |  Slope    
2012-01-10  |  ipad     |  0.12  |  0.06    
2012-01-11  |  ipad     |  0.17  |  0.06    
2012-01-12  |  ipad     |  0.24  |  0.06    
2012-01-10  |  taco     |  0.19  |  0.13    
2012-01-11  |  taco     |  0.34  |  0.13    
2012-01-12  |  taco     |  0.45  |  0.13

To complicate things, not all Keywords have 3 dates worth of data, some have only 2 for instance.

The simpler the SQL the better since my database is proprietary and I'm not quite sure what formulas are available, although I know it can do OVER(PARTITION BY) if that helps. Thank you!

UPDATE: I define the slope as best fit y=mx+p aka in excel it would be =slope()

Here is another actual example that I usually manipulate in excel:

date        keyword         score       slope   
1/22/2012   water bottle    0.010885442 0.000334784  
1/23/2012   water bottle    0.011203949 0.000334784  
1/24/2012   water bottle    0.008460835 0.000334784  
1/25/2012   water bottle    0.010363991 0.000334784  
1/26/2012   water bottle    0.011800716 0.000334784  
1/27/2012   water bottle    0.012948411 0.000334784  
1/28/2012   water bottle    0.012732459 0.000334784  
1/29/2012   water bottle    0.011682568 0.000334784  
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
datayoda
  • 1,067
  • 2
  • 12
  • 24
  • 5
    Take a look at this question on Linear Regression in SQL. http://stackoverflow.com/questions/2536895/are-there-any-linear-regression-function-in-sql-server – N West Jan 31 '12 at 18:19
  • 2
    @datayoda - How do you define the slope? If you have more than two data points, are you just taking the slope from the earliest point to the latest point? Or are you trying to draw a best fit line between all the points? If so, using linear least-squares or some other algorithm? Do you want to force the line to hit one of the points (knowing that will decrease the overall goodness of fit)? – Justin Cave Jan 31 '12 at 18:21
  • 2
    There are some linear regression functions in Oracle. http://docs.oracle.com/cd/B10501_01/server.920/a96540/functions101a.htm#85925 – N West Jan 31 '12 at 18:24
  • I don't have regr_slope available :( – datayoda Jan 31 '12 at 19:17
  • @NWest is correct, you need to use linear-regression, and his first link seems like quite a good one. A simple `y = mx + c` will not work as your data-points do not, and are highly unlikely to make a perfect "curve". Also bear in mind that with only 2-3 data-points your curve is going to be fairly inaccurate anyway. – Ben Jan 31 '12 at 20:08

4 Answers4

16

The cleanest one I could make:

SELECT
    Scores.Date, Scores.Keyword, Scores.Score,
    (N * Sum_XY - Sum_X * Sum_Y)/(N * Sum_X2 - Sum_X * Sum_X) AS Slope
FROM Scores
INNER JOIN (
    SELECT
        Keyword,
        COUNT(*) AS N,
        SUM(CAST(Date as float)) AS Sum_X,
        SUM(CAST(Date as float) * CAST(Date as float)) AS Sum_X2,
        SUM(Score) AS Sum_Y,
        SUM(CAST(Date as float) * Score) AS Sum_XY
    FROM Scores
    GROUP BY Keyword
) G ON G.Keyword = Scores.Keyword;

It uses Simple Linear Regression to calculate the slope.

Result:

Date         Keyword        Score         Slope
2012-01-22   water bottle   0,010885442   0,000334784345222076
2012-01-23   water bottle   0,011203949   0,000334784345222076
2012-01-24   water bottle   0,008460835   0,000334784345222076
2012-01-25   water bottle   0,010363991   0,000334784345222076
2012-01-26   water bottle   0,011800716   0,000334784345222076
2012-01-27   water bottle   0,012948411   0,000334784345222076
2012-01-28   water bottle   0,012732459   0,000334784345222076
2012-01-29   water bottle   0,011682568   0,000334784345222076

Every database system seems to have a different approach to converting dates to numbers:

  • MySQL: TO_SECONDS(date) or TO_DAYS(date)
  • Oracle: TO_NUMBER(TO_CHAR(date, 'J')) or date - TO_DATE('1','yyyy')
  • MS SQL Server: CAST(date AS float) (or equivalent CONVERT)
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
  • @arielhasidim Yes. I removed it. It is only used for calculating confidence intervals and the correlation coefficient. This is beyond the scope of the question. – Markus Jarderot Feb 26 '21 at 17:45
1

If you're defining slope as just the slope from the earliest point to the latest point, and if score only increases with date, then you can get the output above with this:

SELECT *
  FROM scores
  JOIN
    (SELECT foo.keyword,
            (MAX(score)-MIN(score)) / DATEDIFF(MAX(date),MIN(date)) AS score
     FROM scores
     GROUP BY keyword) a
  USING(keyword);

However if you want linear regression, or if scores can decrease as well as increase with time, you'll need something more complex.

gcbenison
  • 11,723
  • 4
  • 44
  • 82
0

Cast to decimal does not give correct results for me, it is not linear to the dates. Use TO_DAYS(date_field) instead, this becomes correct.

albciff
  • 18,112
  • 4
  • 64
  • 89
0

Use this

SUM(CONVERT(float, datediff(dd, '1/1/1900', date_field)))

instead of

SUM(CAST(date_field AS float))
robsiemb
  • 6,157
  • 7
  • 32
  • 46
Sujan
  • 1