I have some data in BigQuery in a table which is partitioned based on a DATE
column.
Please consider the following simplified example for demonstration:
| product_id | performance_date | computed_at | clicks |
| ----------- | ---------------- | ------------------- | ------ |
| 1 | 2023-06-01 | 2023-06-01T08:00:00 | 5 |
| 1 | 2023-06-01 | 2023-06-01T09:00:00 | 10 |
| 2 | 2023-06-01 | 2023-06-01T08:00:00 | 50 |
| 2 | 2023-06-01 | 2023-06-01T09:00:00 | 100 |
| 1 | 2023-06-02 | 2023-06-01T08:00:00 | 2 |
| 1 | 2023-06-02 | 2023-06-01T09:00:00 | 4 |
CREATE TABLE
my_dataset.product_performance (
product_id INT64,
performance_date DATE,
computed_at DATETIME,
clicks INT64
)
PARTITION BY
performance_date
OPTIONS (require_partition_filter = TRUE);
------------------------------------------------------------
INSERT my_dataset.product_performance
(product_id, performance_date, computed_at, clicks)
VALUES
(1, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 8, 0, 0), 5),
(1, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 9, 0, 0), 10),
(2, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 8, 0, 0), 50),
(2, DATE(2023, 6, 1), DATETIME(2023, 6, 1, 9, 0, 0), 100),
(1, DATE(2023, 6, 2), DATETIME(2023, 6, 1, 8, 0, 0), 2),
(1, DATE(2023, 6, 2), DATETIME(2023, 6, 1, 9, 0, 0), 4);
------------------------------------------------------------
I need to get the aggregated performance per product on each day based only on the last computed performance records on that day.
The following query will give what I am looking for:
SELECT
performance.product_id AS p_id,
performance.performance_date AS p_date,
SUM(performance.clicks) AS clicks
FROM
`my_project.my_dataset.product_performance` AS performance
WHERE
performance.performance_date = DATE(2023, 6, 1)
AND performance.computed_at = (
SELECT
MAX(performance_inner.computed_at)
FROM
`my_project.my_dataset.product_performance` AS performance_inner
WHERE
performance_inner.performance_date = performance.performance_date
AND performance_inner.product_id = performance.product_id
)
GROUP BY
p_id, p_date
As my query is more complex calculating a lot of performance metrics and I need to run this query based on multiple time ranges at the same time, I extracted the date filter predicate so that I can reuse a base
query multiple times.
I also need to play with date ranges, so I need to use some UDFs. The following simplified query is what I actually need:
CREATE TEMP FUNCTION refine_date(
date_to_be_refined DATE
) RETURNS DATE AS (
DATE_ADD(date_to_be_refined, INTERVAL 1 DAY)
);
WITH performance_base AS (
SELECT
performance.product_id AS p_id,
performance.performance_date AS p_date,
SUM(performance.clicks) AS clicks
FROM
`my_project.my_dataset.product_performance` AS performance
WHERE
performance.computed_at = (
SELECT
MAX(performance_inner.computed_at)
FROM
`my_project.my_dataset.product_performance` AS performance_inner
WHERE
performance_inner.performance_date = performance.performance_date
AND performance_inner.product_id = performance.product_id
)
GROUP BY
p_id, p_date
)
SELECT
*
FROM
performance_base
WHERE
p_date = refine_date(DATE(2023, 6, 1))
But unfortunately, it fails with the message:
Cannot query over table 'my_project.my_dataset.product_performance' without a filter over column(s) 'performance_date' that can be used for partition elimination
What I could not understand is that if I replace the filter in my query to avoid using the UDF, the query executes fine:
WITH performance_base AS (
SELECT
performance.product_id AS p_id,
performance.performance_date AS p_date,
SUM(performance.clicks) AS clicks
FROM
`my_project.my_dataset.product_performance` AS performance
WHERE
performance.computed_at = (
SELECT
MAX(performance_inner.computed_at)
FROM
`my_project.my_dataset.product_performance` AS performance_inner
WHERE
performance_inner.performance_date = performance.performance_date
-- performance_inner.performance_date = refine_date(DATE(2023, 6, 1))
AND performance_inner.product_id = performance.product_id
)
GROUP BY
p_id, p_date
)
SELECT
*
FROM
performance_base
WHERE
p_date = DATE(2023, 6, 1)
-- p_date = refine_date(DATE(2023, 6, 1))
It will also run if I use the UDF for both the outer and the inner query (replacing the commented out lines with the lines directly preceding them).
Could you please explain to me what the problem in my desired query exactly is? And how to get it working?
Please notice that the base query should not be aware of the date range used with the predicate filter and I still need to use a UDF as a lot of date-based calculations are required to get the desired date range for the query.
UPDATE:
Using the BigQuery Procedural Language (BQPL), I managed to get the results I am looking for keeping all the date-based calculations in UDFs.
DECLARE refined_date DATE;
CREATE TEMP FUNCTION refine_date(
date_to_be_refined DATE
) RETURNS DATE AS (
DATE_ADD(date_to_be_refined, INTERVAL 1 DAY)
);
SET refined_date = refine_date(DATE(2023, 6, 1));
WITH performance_base AS (
SELECT
performance.product_id AS p_id,
performance.performance_date AS p_date,
SUM(performance.clicks) AS clicks
FROM
`my_project.my_dataset.product_performance` AS performance
WHERE
performance.computed_at = (
SELECT
MAX(performance_inner.computed_at)
FROM
`my_project.my_dataset.product_performance` AS performance_inner
WHERE
performance_inner.performance_date = performance.performance_date
AND performance_inner.product_id = performance.product_id
)
GROUP BY
p_id, p_date
)
SELECT
*
FROM
performance_base
WHERE
p_date = refined_date
I am running this query from Looker to get a Temporary Derived Table for a dashboard. Unfortunately, BQPL is still not supported in Looker. Any recommendations or hints to look for?