2

Can someone please help me understand the circumstances in which regr_slope returns NULL where there are rows in the data set? For example:

log=> select * from sb1 order by id, ts;
  id  | elapsed |       ts       
------+---------+----------------
 317e |      86 | 1552861322.627
 317e |      58 | 1552861324.747
 317e |      52 | 1552861325.722
 317e |      58 | 1552861326.647
 317e |      82 | 1552861327.609
 317e |     118 | 1552861328.514
 317e |      58 | 1552861329.336
 317e |      58 | 1552861330.317
 317e |      54 | 1552861330.935
 3441 |      68 | 1552861324.765
 3441 |      84 | 1552861326.665
 3441 |      56 | 1552861327.627
 3441 |      50 | 1552861330.952
 5fe6 |      42 | 1552993248.398
 5fe6 |      44 | 1552993255.883
 5fe6 |      44 | 1553166049.261
 c742 |      62 | 1552861322.149
 c742 |      68 | 1552861322.455
(18 rows)

log=> select id, regr_slope(elapsed, ts) as trend from sb1 group by id;
  id  |        trend         
------+----------------------
 c742 |                     
 317e |                     
 5fe6 | 5.78750952760444e-06
 3441 |                     
(4 rows)

Interestingly, the same dataset and function returns different results in Oracle 11.2 :

SQL> select * from sb1 order by id, ts;

ID            ELAPSED               TS
---------- ---------- ----------------
317e               86   1552861322.627
317e               58   1552861324.747
317e               52   1552861325.722
317e               58   1552861326.647
317e               82   1552861327.609
317e              118   1552861328.514
317e               58   1552861329.336
317e               58   1552861330.317
317e               54   1552861330.935
3441               68   1552861324.765
3441               84   1552861326.665
3441               56   1552861327.627
3441               50   1552861330.952
5fe6               42   1552993248.398
5fe6               44   1552993255.883
5fe6               44   1553166049.261
c742               62   1552861322.149
c742               68   1552861322.455

18 rows selected.

SQL> select id, regr_slope(elapsed, ts) from sb1 group by id;

ID         REGR_SLOPE(ELAPSED,TS)
---------- ----------------------
c742                   19.6078431
5fe6                   5.7875E-06
317e                   -1.0838511
3441                   -3.8283951

I don't know if this means there is a problem with Postgres, Oracle, neither or both, although the results for 5fe6 are the same.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Horrendo
  • 393
  • 3
  • 14
  • I didn't check, but it be that the regression line is vertical? The PostgreSQL source has: `/* per spec, return NULL for a vertical line */` – Laurenz Albe Mar 22 '19 at 10:05
  • Thanks @LaurenzAlbe, based on the values returned by Oracle, they're not even close to vertical. Do you think this is a bug I should report? – Horrendo Mar 22 '19 at 22:02

2 Answers2

5

After digging into the code, I have the answer:

The problem is that the naïve approach of PostgreSQL up to v12 leads to unnecessarily big rounding errors in this case.

Let's consider id = 'c742':

The formula for regr_slope is like this:

regr_slope := (N ⋅ Σ(XiYi) - ΣXi ⋅ ΣYi) / (N ⋅ Σ(Xi2) - ΣXi ⋅ ΣXi)

The problem is in the divisor:

SELECT 2::float8 * (1552861322.149::float8 * 1552861322.149::float8 +
                    1552861322.455::float8 * 1552861322.455::float8) -
       (1552861322.149::float8 + 1552861322.455::float8) *
       (1552861322.149::float8 + 1552861322.455::float8);

 ?column? 
----------
    -2048 
(1 row)

Since the result is negative, PostgreSQL returns a NULL result.

This would not have happened using exact computation (using numeric):

SELECT 2 * (1552861322.149 * 1552861322.149 +
            1552861322.455 * 1552861322.455) -
       (1552861322.149 + 1552861322.455) *
       (1552861322.149 + 1552861322.455);

 ?column? 
----------
 0.093636
(1 row)

Since PostgreSQL commit e954a727f0c8872bf5203186ad0f5312f6183746, things are improved, and in PostgreSQL v12, PostgreSQL also returns the correct result:

select id, regr_slope(elapsed, ts) from sb1 group by id;

  id  |      regr_slope       
------+-----------------------
 c742 |    19.607858781290517
 317e |   -1.0838511987808963
 5fe6 | 5.787509483586743e-06
 3441 |    -3.828395463097356
(4 rows)
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
2

It turns out this is due to a bug in the calculation of the Sxx value caused by the large values I have for my x axis. This is described here if you're interested.

Because my x values are derived from extract(epoch from tstz_col), it is simple for me to subtract a 'base' value so the Sxx calculation doesn't overflow. For example:

log=> select id, regr_slope(elapsed, ts - extract(epoch from now())) as trend from sb1 group by id;
  id  |        trend         
------+----------------------
 c742 |     19.5839996337891
 317e |    -1.08384865545836
 5fe6 | 5.78750948360273e-06
 3441 |    -3.82839498627572
(4 rows)

The results aren't exactly the same as returned by Oracle but I'm only looking for a 'trend' so they're fine in my case.

H.

Horrendo
  • 393
  • 3
  • 14