0

First question StackOverflow and just getting started with SQL and BigQuery.

On BigQuery I have a project with one table: table.
This table contains:
-some data x,
-some timestamps for each entry time,
-some group identifiers group that partition the entries,
-and an INTEGER column called rank with all NULL values.

I want to use UPDATE to change these NULL values:
rank should represent the rank of each row within the respective group based on the time, in ascending order (rank=1 for the earliest x recorded in the group, rank=2 for the the following one etc.)

Here is how i set the query:

UPDATE project.table

SET table.rank = (
WITH temp AS (select *, GENERATE_UUID() AS id FROM project.table),
rank1 AS (
  SELECT * , RANK() OVER(PARTITION BY group ORDER BY time ASC)
  AS rankvalue FROM temp 
  ) 

SELECT rank1.rankvalue
FROM  temp INNER JOIN rank1 ON temp.id=rank1.id 
)
WHERE table.rank IS NULL 

The query is executed without errors and BigQuery tells me that all rows have been modified, as expected. But when I check, the values of rank are all still NULL.
Since I have checked that rankvalue is generated as expected, my guess is that there is some problem in the JOIN statement.
Though, I have checked that rank1 does indeed inherit from temp all the id I have created. So I don't understand why the JOIN fails.

ALTERNATIVE:
I tried an alternative way to do this: I tried first assigning a unique row identifier to table, called id, and then use the following query:

UPDATE project.table

SET table.rank = (
WITH rank1 AS (
  SELECT * , RANK() OVER(PARTITION BY group ORDER BY time ASC)
  AS rankvalue FROM project.table
  ) 

SELECT rank1.rankvalue
FROM  project.table INNER JOIN rank1 ON table.id=rank1.id 
)
WHERE table.rank IS NULL

But this query returns an error: Scalar subquery produced more than one element. I can't understand why, since I verified that the WITH/SELECT/FROM subquery returns exactly the same number of rows as there are in table.

Any heads up is greatly appreciated.

UPDATE:
I tried the following and it worked:

UPDATE project.table
SET table.rank = (
      WITH rank1 AS (
        SELECT * , RANK() 
            OVER(
        PARTITION BY group
        ORDER BY time AS
                         ) AS rankvalue
        FROM project.table) 

      SELECT rank1.rankvalue
      FROM rank1 WHERE
            table.id=rank1.id
                )
WHERE TRUE

Which is a modification of the second alternative I tried before. Question: Is it now working because WHERE table.id=rank1.id picks each individual table.id corresponding to each table.rank that is undergoing the update and matches it to the corresponding rank.id (that is, it picks one row in the RHS for each row in the LHS), while project.table INNER JOIN rank1 ON table.id=rank1.id would return an entire joined table for each value of table.rank that is undergoing update (i.e. multiple lines on the RHS for each row in the LHS)?

mcavati
  • 1
  • 1
  • Do you understand that a scalar subquery must have one column & one row? PS [mcve] please. – philipxy May 25 '19 at 22:21
  • I thought a scalar subquery in this case should be one column and one row *per row in* `table`, since the `UPDATE project.table SET table.rank=` statement is supposed to edit multiple rows, not only one. What am I missing here? – mcavati May 25 '19 at 22:47
  • 1
    See if this helps: https://stackoverflow.com/a/53656698/6253347 – Elliott Brossard May 25 '19 at 23:27
  • 1
    Thanks a ton Elliott for the link. I did have some duplicates I was not aware of and I eliminated them using the answer in your link. Though, I still get the exact same errors in both the alternative queries in my question. – mcavati May 26 '19 at 05:00
  • It's true that UPDATE updates a bunch of rows, but each updated row has each of its columns updated to the one value expressed by the SET using its old column values & the values of the columns of the one row it got joined to per the FROM. A subquery where a value is expected must be a scalar subquery holding the one value. Read an introduction/manual re the parts of the language you use before you use them & before you ask a question re how you used them. [BigQuery Documentation Data Manipulation Language Syntax](https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax) – philipxy May 26 '19 at 07:36
  • Possible duplicate of [Does BigQuery support UPDATE, DELETE, and INSERT (SQL DML) statements?](https://stackoverflow.com/q/38922795/3404097) – philipxy May 26 '19 at 07:40
  • Thank you @philipxy for your answer. I read more documentation and I think I found the right way to do it. See my question update. Would you mind answering the question at the bottom just to make sure that I understood the solution correctly? Thank you so much – mcavati May 26 '19 at 19:18
  • This would be more clearly expressed & possibly more simply calculated if you used an UPDATE FROM, so that the command calculates the join of the base with a table & keeps those satisfying a WHERE then for each kept base row SETs rank per the various columns it was joined to. Right now a naive (& possibly actual) evaluation of your queries has to do a select--the subselect--for every UPDATE WHERE row--which includes a join & partitioning & ranking--whereas what I described does that once for the whole table. – philipxy May 26 '19 at 19:54
  • Maybe the process you are trying to describe is correct but your description is not clear. Does "each table.rank that is undergoing the update" mean rank of each base row satisfying `true`?--OK. What are LHS & RHS & what are they sides of? You talk about deep subexpressions doing things but it's not clear what you mean because all a subexpression *actually* does is participate with its sibling subexpressions in determining the value of a parent subexpression. PS The RHS of = must specify one value. If via a subquery, it must be scalar. Now it is--OK. Before it wasn't--invalid & nonsensical. – philipxy May 26 '19 at 20:21
  • I suggest you describe what you think happens by following the description in some SQL/bigquery documention & ask about where you are stuck or dubious. Be sure to mention which *rows* get updated then what similar thing is done to each *row*. Describe a general step/part then its reflection in the example. Describe the not working case separately similarly. PS A "basically" or "essentially" or "in other words" that doesn't introduce or summarize a clear, precise & full description that you also give just means "unclearly". PS Your title seems to mention an earlier irrelevant error message? – philipxy May 26 '19 at 21:10

0 Answers0