1

SQL BigQuery - I have duplicate rows on the primary key that I need to remove (I don't want to permanently delete from the table). I have to GROUP BY several other fields to aggregate results, and am having issues since I have to include the primary key but cannot group by it and remove duplicates at the same time.

I saw examples on stack but am having issues applying to this case. I have tried the below, and that is not resulting in any groupings because I have to group by the ID to leave it in the select statement.

SELECT Distinct (ID),Event_ID,Event_Name,Sum(Quantity), Sum(Price*Quantity)
FROM table
GROUP BY ID, Event_ID, Event_Name

I tried removing ID from the group by clause and adding "Min (Distinct (ID))," as the first select line, but that was not removing the duplicates.

I also tried

SELECT Distinct (ID),Event_ID, Event_Name, Sum(Quantity) ,Sum(Price*Quantity), 
FROM table

INNER JOIN 
  (SELECT ID as DeduplicatedID
  FROM table 
  GROUP BY ID) deduplicatedtable ON table.ID = deduplicatedtable.ID

GROUP BY ID, Event_ID, Event_Name

and received this error

Thanks in advance!

Here is a screenshot

of the group at the ID level, however some of these IDs have duplicates, which provide incorrect TotalQuantity and TotalRevenue numbers when grouped by the Event

Here is a picture

of what I would want the grouped data to look like, but this is incorrect and includes the duplicate IDs when the quantity and rev fields are summed.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
Jacques
  • 57
  • 2
  • 7
  • 2
    Primary keys -- by definition -- do not have duplicates. Please explain what column or columns should not be duplicated. Sample data and desired results would be helpful. – Gordon Linoff Feb 27 '20 at 21:03
  • correct - I guess it is supposed to be the primary key but for some reason there are duplicates of certain rows and I can't parse out a rhyme or reason why it is only some rows and not others...I am adding screenshots of data now to Original post – Jacques Feb 27 '20 at 21:19
  • added the pictures above – Jacques Feb 27 '20 at 21:24
  • What exactly is duplicated? The the entire row or only the id value? I mean, for the duplicated ID value, the rest of the row is similar or different? – rmesteves Feb 28 '20 at 08:55
  • There are several answers to do this in this [post](https://stackoverflow.com/questions/12239169/how-to-select-records-without-duplicate-on-just-one-field-in-sql), including Gordon's. Try [this one](https://stackoverflow.com/a/29912858/11928117) – Ksign Feb 28 '20 at 10:32
  • @rmesteves the entire row is duplicated – Jacques Feb 28 '20 at 14:24
  • If the entire row is duplicated then both of the below answers should work for you. – Ksign Feb 28 '20 at 15:57

3 Answers3

0

Does this do what you want?

SELECT MIN(ID) as id, Event_ID,Event_Name,Sum(Quantity), Sum(Price*Quantity)
FROM table
GROUP BY Event_ID, Event_Name
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • unfortunately that does not get rid of the duplicates for me; unsure why exactly – Jacques Feb 27 '20 at 21:30
  • to add, when I run this, it is getting rid of all of the other IDs that are not duplicates and that are not the min ID – Jacques Mar 03 '20 at 20:59
0

Try doing

WITH t as (
SELECT Distinct *
FROM table
)
SELECT ID, Event_ID, Event_Name, Sum(Quantity), Sum(Price*Quantity)
FROM t
GROUP BY ID, Event_ID, Event_Name
rmesteves
  • 3,870
  • 7
  • 23
0

If, as I suspect, there is one field that is not duplicated, find it and use ANY_VALUE(not_duplicated_field).
The ANY_VALUE function returns any value from the input or NULL if there are zero input rows. The value returned is non-deterministic, which means you might receive a different result each time you use this function.

For example:

SELECT DISTINCT ID, Event_ID, ANY_VALUE(Event_Name), sum(Quantity), sum(Price*Quantity)
FROM blabla.jack
GROUP BY id, Event_ID
Ksign
  • 779
  • 5
  • 11