How to perform linear regression in BigQuery?

Question

BigQuery has some statistical aggregation functions such as STDDEV(X) and CORR(X, Y), but it doesn't offer functions to directly perform linear regression.

How can one compute a linear regression using the functions that do exist?

Hi sprocket! Could you change the accepted answer to the now native support? — Felipe Hoffa, Mar 31 '19 at 17:16

score 19 · Accepted Answer · edited Mar 31 '19 at 17:16

Editor's edit: Please see next answer, linear regression is now natively supported in BigQuery. --Fh

The following query performs a linear regression using calculations that are numerically stable and easily modified to work over any input table. It produces the slope and intercept of the best fit to the model Y = SLOPE * X + INTERCEPT and the Pearson correlation coefficient using the builtin function CORR.

As an example, we use the public natality dataset to compute birth weight as a linear function of the duration of pregnancy, broken down by state. You could write this more compactly, but we use several layers of subqueries to highlight how the pieces go together. To apply this to another dataset, you just need to replace the innermost query.

SELECT Bucket,
       SLOPE,
       (SUM_OF_Y - SLOPE * SUM_OF_X) / N AS INTERCEPT,
       CORRELATION
FROM (
    SELECT Bucket,
           N,
           SUM_OF_X,
           SUM_OF_Y,
           CORRELATION * STDDEV_OF_Y / STDDEV_OF_X AS SLOPE,
           CORRELATION
    FROM (
        SELECT Bucket,
               COUNT(*) AS N,
               SUM(X) AS SUM_OF_X,
               SUM(Y) AS SUM_OF_Y,
               STDDEV_POP(X) AS STDDEV_OF_X,
               STDDEV_POP(Y) AS STDDEV_OF_Y,
               CORR(X,Y) AS CORRELATION
        FROM (SELECT state AS Bucket,
                     gestation_weeks AS X,
                     weight_pounds AS Y
              FROM [publicdata.samples.natality])
        WHERE Bucket IS NOT NULL AND
              X IS NOT NULL AND
              Y IS NOT NULL
        GROUP BY Bucket));

Using the STDDEV_POP and CORR functions improves the numerical stability of this query compared to summing up products of X and Y and then taking differences and dividing, but if you use both approaches on a well-behaved dataset, you can verify that they produce the same results to high accuracy.

Absolutely incredible. Without using BQ ML, is there a similar query that does logarithmic regression? — Justin, Jun 08 '23 at 23:09

score 3 · Answer 2 · answered Jan 10 '19 at 22:29

3

Good news! BigQuery now has native support for ML.

To produce a linear regression CREATE MODEL, and then predict with SELECT FROM ML.PREDICT.

Docs:

https://cloud.google.com/bigquery/docs/bigqueryml

Fun example: When will Stack Overflow reply

answered Jan 10 '19 at 22:29

Felipe Hoffa

54,922
16
151
325

This is not helpful. I only need two functions, regr_slope and regr_intercept. BigQuery ML can do the predict but it can't tell me that two numbers. The first answer solved my problem perfectly. – crvv Jul 22 '22 at 09:19

score 0 · Answer 3 · answered May 01 '19 at 15:06

Here the code to create a linear regression model using the public dataset on natality (live births) and to generate this into a dataset named demo_ml_bq. This must be created before running the below statement.

%%bq query
CREATE or REPLACE MODEL demo_bq_ml.babyweight_model_asis
OPTIONS
  (model_type='linear_reg', labels=['weight_pounds']) AS

WITH natality_data AS (
  SELECT
     weight_pounds, -- this is the label; because it is continuous, we need to use regression
    CAST(is_male AS STRING) AS is_male,
    mother_age,
    CAST(plurality AS STRING) AS plurality,
    gestation_weeks,
    CAST(alcohol_use AS STRING) AS alcohol_use,
    CAST(year AS STRING) AS year,
    ABS(FARM_FINGERPRINT(CONCAT(CAST(YEAR AS STRING), CAST(month AS STRING)))) AS hashmonth
  FROM
    publicdata.samples.natality
  WHERE
    year > 2000
    AND gestation_weeks > 0
    AND mother_age > 0
    AND plurality > 0
    AND weight_pounds > 0
)

SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    alcohol_use,
    year
FROM
    natality_data
WHERE
  MOD(hashmonth, 4) < 3  -- select 75% of the data as training

How to perform linear regression in BigQuery?

3 Answers3

Linked