How do you remove duplicate rows of primary key and also group by in SQL Bigquery?

Question

SQL BigQuery - I have duplicate rows on the primary key that I need to remove (I don't want to permanently delete from the table). I have to GROUP BY several other fields to aggregate results, and am having issues since I have to include the primary key but cannot group by it and remove duplicates at the same time.

I saw examples on stack but am having issues applying to this case. I have tried the below, and that is not resulting in any groupings because I have to group by the ID to leave it in the select statement.

SELECT Distinct (ID),Event_ID,Event_Name,Sum(Quantity), Sum(Price*Quantity)
FROM table
GROUP BY ID, Event_ID, Event_Name

I tried removing ID from the group by clause and adding "Min (Distinct (ID))," as the first select line, but that was not removing the duplicates.

I also tried

SELECT Distinct (ID),Event_ID, Event_Name, Sum(Quantity) ,Sum(Price*Quantity), 
FROM table

INNER JOIN 
  (SELECT ID as DeduplicatedID
  FROM table 
  GROUP BY ID) deduplicatedtable ON table.ID = deduplicatedtable.ID

GROUP BY ID, Event_ID, Event_Name

and received this error

Thanks in advance!

Here is a screenshot

of the group at the ID level, however some of these IDs have duplicates, which provide incorrect TotalQuantity and TotalRevenue numbers when grouped by the Event

Here is a picture

of what I would want the grouped data to look like, but this is incorrect and includes the duplicate IDs when the quantity and rev fields are summed.

Primary keys -- by definition -- do not have duplicates. Please explain what column or columns should not be duplicated. Sample data and desired results would be helpful. — Gordon Linoff, Feb 27 '20 at 21:03
correct - I guess it is supposed to be the primary key but for some reason there are duplicates of certain rows and I can't parse out a rhyme or reason why it is only some rows and not others...I am adding screenshots of data now to Original post — Jacques, Feb 27 '20 at 21:19
What exactly is duplicated? The the entire row or only the id value? I mean, for the duplicated ID value, the rest of the row is similar or different? — rmesteves, Feb 28 '20 at 08:55
There are several answers to do this in this [post](https://stackoverflow.com/questions/12239169/how-to-select-records-without-duplicate-on-just-one-field-in-sql), including Gordon's. Try [this one](https://stackoverflow.com/a/29912858/11928117) — Ksign, Feb 28 '20 at 10:32
If the entire row is duplicated then both of the below answers should work for you. — Ksign, Feb 28 '20 at 15:57

score 0 · Answer 1 · answered Feb 27 '20 at 21:05

0

Does this do what you want?

SELECT MIN(ID) as id, Event_ID,Event_Name,Sum(Quantity), Sum(Price*Quantity)
FROM table
GROUP BY Event_ID, Event_Name

answered Feb 27 '20 at 21:05

Gordon Linoff

1,242,037
58
646
786

unfortunately that does not get rid of the duplicates for me; unsure why exactly – Jacques Feb 27 '20 at 21:30
to add, when I run this, it is getting rid of all of the other IDs that are not duplicates and that are not the min ID – Jacques Mar 03 '20 at 20:59

score 0 · Answer 2 · answered Feb 28 '20 at 14:51

0

Try doing

WITH t as (
SELECT Distinct *
FROM table
)
SELECT ID, Event_ID, Event_Name, Sum(Quantity), Sum(Price*Quantity)
FROM t
GROUP BY ID, Event_ID, Event_Name

answered Feb 28 '20 at 14:51

rmesteves

3,870
7
23

score 0 · Answer 3 · answered Mar 02 '20 at 15:32

If, as I suspect, there is one field that is not duplicated, find it and use ANY_VALUE(not_duplicated_field).
The ANY_VALUE function returns any value from the input or NULL if there are zero input rows. The value returned is non-deterministic, which means you might receive a different result each time you use this function.

For example:

SELECT DISTINCT ID, Event_ID, ANY_VALUE(Event_Name), sum(Quantity), sum(Price*Quantity)
FROM blabla.jack
GROUP BY id, Event_ID

How do you remove duplicate rows of primary key and also group by in SQL Bigquery?

3 Answers3