First question StackOverflow and just getting started with SQL and BigQuery.
On BigQuery I have a project with one table: table
.
This table contains:
-some data x
,
-some timestamps for each entry time
,
-some group identifiers group
that partition the entries,
-and an INTEGER
column called rank
with all NULL
values.
I want to use UPDATE
to change these NULL
values:
rank
should represent the rank of each row within the respective group
based on the time
, in ascending order (rank=1
for the earliest x
recorded in the group, rank=2
for the the following one etc.)
Here is how i set the query:
UPDATE project.table
SET table.rank = (
WITH temp AS (select *, GENERATE_UUID() AS id FROM project.table),
rank1 AS (
SELECT * , RANK() OVER(PARTITION BY group ORDER BY time ASC)
AS rankvalue FROM temp
)
SELECT rank1.rankvalue
FROM temp INNER JOIN rank1 ON temp.id=rank1.id
)
WHERE table.rank IS NULL
The query is executed without errors and BigQuery tells me that all rows have been modified, as expected. But when I check, the values of rank
are all still NULL
.
Since I have checked that rankvalue
is generated as expected, my guess is that there is some problem in the JOIN
statement.
Though, I have checked that rank1
does indeed inherit from temp
all the id
I have created. So I don't understand why the JOIN
fails.
ALTERNATIVE:
I tried an alternative way to do this: I tried first assigning a unique row identifier to table
, called id
, and then use the following query:
UPDATE project.table
SET table.rank = (
WITH rank1 AS (
SELECT * , RANK() OVER(PARTITION BY group ORDER BY time ASC)
AS rankvalue FROM project.table
)
SELECT rank1.rankvalue
FROM project.table INNER JOIN rank1 ON table.id=rank1.id
)
WHERE table.rank IS NULL
But this query returns an error: Scalar subquery produced more than one element
. I can't understand why, since I verified that the WITH/SELECT/FROM subquery returns exactly the same number of rows as there are in table
.
Any heads up is greatly appreciated.
UPDATE:
I tried the following and it worked:
UPDATE project.table
SET table.rank = (
WITH rank1 AS (
SELECT * , RANK()
OVER(
PARTITION BY group
ORDER BY time AS
) AS rankvalue
FROM project.table)
SELECT rank1.rankvalue
FROM rank1 WHERE
table.id=rank1.id
)
WHERE TRUE
Which is a modification of the second alternative I tried before.
Question: Is it now working because WHERE table.id=rank1.id
picks each individual table.id
corresponding to each table.rank
that is undergoing the update and matches it to the corresponding rank.id
(that is, it picks one row in the RHS for each row in the LHS), while project.table INNER JOIN rank1 ON table.id=rank1.id
would return an entire joined table for each value of table.rank
that is undergoing update (i.e. multiple lines on the RHS for each row in the LHS)?