0

I have some data in BigQuery in a table which is partitioned based on a DATE column.

Please consider the following simplified example for demonstration:

| product_id  | performance_date |     computed_at     | clicks |
| ----------- | ---------------- | ------------------- | ------ |
|      1      |    2023-06-01    | 2023-06-01T08:00:00 |   5    |
|      1      |    2023-06-01    | 2023-06-01T09:00:00 |   10   |
|      2      |    2023-06-01    | 2023-06-01T08:00:00 |   50   |
|      2      |    2023-06-01    | 2023-06-01T09:00:00 |   100  |
|      1      |    2023-06-02    | 2023-06-01T08:00:00 |   2    |
|      1      |    2023-06-02    | 2023-06-01T09:00:00 |   4    |
CREATE TABLE
  my_dataset.product_performance (
    product_id INT64, 
    performance_date DATE, 
    computed_at DATETIME, 
    clicks INT64
  )
PARTITION BY
  performance_date
  OPTIONS (require_partition_filter = TRUE);

------------------------------------------------------------

INSERT my_dataset.product_performance 
  (product_id, performance_date, computed_at, clicks)
VALUES
  (1, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 8, 0, 0), 5),
  (1, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 9, 0, 0), 10),
  (2, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 8, 0, 0), 50),
  (2, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 9, 0, 0), 100),
  (1, DATE(2023, 6, 2), DATETIME(2023, 6, 1, 8, 0, 0), 2),
  (1, DATE(2023, 6, 2), DATETIME(2023, 6, 1, 9, 0, 0), 4);

------------------------------------------------------------

I need to get the aggregated performance per product on each day based only on the last computed performance records on that day.

The following query will give what I am looking for:

SELECT
  performance.product_id AS p_id,
  performance.performance_date AS p_date,
  SUM(performance.clicks) AS clicks
FROM
  `my_project.my_dataset.product_performance` AS performance
WHERE
  performance.performance_date = DATE(2023, 6, 1)
  AND performance.computed_at = (
    SELECT
      MAX(performance_inner.computed_at)
    FROM
      `my_project.my_dataset.product_performance` AS performance_inner
    WHERE
      performance_inner.performance_date = performance.performance_date
      AND performance_inner.product_id = performance.product_id
  )
GROUP BY
  p_id, p_date

As my query is more complex calculating a lot of performance metrics and I need to run this query based on multiple time ranges at the same time, I extracted the date filter predicate so that I can reuse a base query multiple times. I also need to play with date ranges, so I need to use some UDFs. The following simplified query is what I actually need:

CREATE TEMP FUNCTION refine_date(
  date_to_be_refined DATE
) RETURNS DATE AS (
    DATE_ADD(date_to_be_refined, INTERVAL 1 DAY)
);

WITH performance_base AS (
  SELECT
    performance.product_id AS p_id,
    performance.performance_date AS p_date,
    SUM(performance.clicks) AS clicks
  FROM
    `my_project.my_dataset.product_performance` AS performance
  WHERE
    performance.computed_at = (
      SELECT
        MAX(performance_inner.computed_at)
      FROM
        `my_project.my_dataset.product_performance` AS performance_inner
      WHERE
        performance_inner.performance_date = performance.performance_date
        AND performance_inner.product_id = performance.product_id
    )
  GROUP BY
    p_id, p_date
)

SELECT
  *
FROM
  performance_base
WHERE
  p_date = refine_date(DATE(2023, 6, 1))

But unfortunately, it fails with the message:

Cannot query over table 'my_project.my_dataset.product_performance' without a filter over column(s) 'performance_date' that can be used for partition elimination

What I could not understand is that if I replace the filter in my query to avoid using the UDF, the query executes fine:

WITH performance_base AS (
  SELECT
    performance.product_id AS p_id,
    performance.performance_date AS p_date,
    SUM(performance.clicks) AS clicks
  FROM
    `my_project.my_dataset.product_performance` AS performance
  WHERE
    performance.computed_at = (
      SELECT
        MAX(performance_inner.computed_at)
      FROM
        `my_project.my_dataset.product_performance` AS performance_inner
      WHERE
        performance_inner.performance_date = performance.performance_date
        -- performance_inner.performance_date = refine_date(DATE(2023, 6, 1))
        AND performance_inner.product_id = performance.product_id
    )
  GROUP BY
    p_id, p_date
)

SELECT
  *
FROM
  performance_base
WHERE
  p_date = DATE(2023, 6, 1)
  -- p_date = refine_date(DATE(2023, 6, 1))

It will also run if I use the UDF for both the outer and the inner query (replacing the commented out lines with the lines directly preceding them).

Could you please explain to me what the problem in my desired query exactly is? And how to get it working?

Please notice that the base query should not be aware of the date range used with the predicate filter and I still need to use a UDF as a lot of date-based calculations are required to get the desired date range for the query.

UPDATE:

Using the BigQuery Procedural Language (BQPL), I managed to get the results I am looking for keeping all the date-based calculations in UDFs.

DECLARE refined_date DATE;

CREATE TEMP FUNCTION refine_date(
  date_to_be_refined DATE
) RETURNS DATE AS (
    DATE_ADD(date_to_be_refined, INTERVAL 1 DAY)
);

SET refined_date = refine_date(DATE(2023, 6, 1));

WITH performance_base AS (
  SELECT
    performance.product_id AS p_id,
    performance.performance_date AS p_date,
    SUM(performance.clicks) AS clicks
  FROM
    `my_project.my_dataset.product_performance` AS performance
  WHERE
    performance.computed_at = (
      SELECT
        MAX(performance_inner.computed_at)
      FROM
        `my_project.my_dataset.product_performance` AS performance_inner
      WHERE
        performance_inner.performance_date = performance.performance_date
        AND performance_inner.product_id = performance.product_id
    )
  GROUP BY
    p_id, p_date
)

SELECT
  *
FROM
  performance_base
WHERE
  p_date = refined_date


I am running this query from Looker to get a Temporary Derived Table for a dashboard. Unfortunately, BQPL is still not supported in Looker. Any recommendations or hints to look for?

Mousa
  • 2,926
  • 1
  • 27
  • 35

0 Answers0