How to deduplicate in Presto

Question

I have a Presto table assume it has [id, name, update_time] columns and data

(1, Amy, 2018-08-01),
(1, Amy, 2018-08-02),
(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

Now, I want to execute a sql and the result will be

(1, Amyyyyyyy, 2018-08-03),
(2, Bob, 2018-08-01)

Currently, my best way to deduplicate in Presto is below.

select 
    t1.id, 
    t1.name,
    t1.update_time 
from table_name t1
join (select id, max(update_time) as update_time from table_name group by id) t2
    on t1.id = t2.id and t1.update_time = t2.update_time

More information, clike deduplication in sql

Is there a better way to deduplicate in Presto?

score 11 · Accepted Answer · answered Aug 01 '18 at 10:55

11

In PrestoDB, I would be inclined to use row_number():

select id, name, date
from (select t.*,
             row_number() over (partition by name order by date desc) as seqnum
      from table_name t
     ) t
where seqnum = 1;

answered Aug 01 '18 at 10:55

Gordon Linoff

1,242,037
58
646
786

That is what I wanted! Your sql did good performance: less input data but more cpu and memory. I noted that in [note](https://note.youdao.com/share/?id=fc13cbac5a4df6df599fd14cbf590fa5&type=note#/) – Archon Aug 02 '18 at 05:17

score 1 · Answer 2 · edited Aug 01 '18 at 10:11

1

You seems want subquery :

select t.*
from table t
where update_time = (select MAX(t1.update_time) from table t1 where t1.id = t.id);

edited Aug 01 '18 at 10:11

Tonner Mààn

133
1
11

answered Aug 01 '18 at 09:48

Yogesh Sharma

49,870
5
26
52

It did better than my sql(way 1) and I add to my note, thanks – Archon Aug 01 '18 at 10:01

score 1 · Answer 3 · answered Apr 01 '21 at 15:29

1

Here is another way

WITH latestDate AS (SELECT id,max(date) as latestDate FROM table_name GROUP BY id)
    SELECT id,name,date FROM table_name t INNER JOIN latestDate l ON t.id = l.id AND t.date = l.latestDate

answered Apr 01 '21 at 15:29

Dwamian Mcleish

11
2

Tonner Mààn · Answer 4 · 2018-08-01T10:24:32.987

0

It's easy:

Select id, name, MAX(update_time) as [Last Update] from table_name Group by id

Hope it helps

edited Aug 01 '18 at 10:24

answered Aug 01 '18 at 09:57

Tonner Mààn

133
1
11

1

Thanks for your answer. It's my fault that not describe my question clearly. And I already edited my question. Please check. Thank for your help. – Archon Aug 01 '18 at 10:05
@AlexGum in my code i only need to remove the column `Name` from the `Group by` clause (i did that edit already), the other code that @Yogesh Sharma suggested should work too – Tonner Mààn Aug 01 '18 at 10:30
1

Thanks for your answer. I appreciate it. but in Presto sql, column in `SELECT` clause must in `GROUP BY` clause, not like MySql. The answer from Gordon Linoff is really "Presto" style. Thanks a lot anyway. – Archon Aug 02 '18 at 05:27

score 0 · Answer 5 · answered Aug 01 '18 at 10:16

0

just use in operator

 select t.*
    from tableA t
    where update_time in (select MAX(tableA.update_time) from tableA goup by id)

answered Aug 01 '18 at 10:16

Zaynul Abadin Tuhin

31,407
5
33
63

I don't think this will work, it will also show the row `(1, Amy, 2018-08-01)` because the date `2018-08-01` is the max for the id = 2, you can try it and see, my english is not good enough to explain it well! – Tonner Mààn Aug 01 '18 at 10:35
@TonnerMààn no you are wrong because i consider only id not name – Zaynul Abadin Tuhin Aug 01 '18 at 10:42
it's the column update_time where there is the issue, i ran a test just to be sure and the resaults are as i expected, i'm sorry my english is not very good that's why i didn't know how to explain my idea – Tonner Mààn Aug 01 '18 at 10:46
1

I think @TonnerMààn is right, I had try it. The answer GordonLinoff and YogeshSharma given has the best performance for now. I noted them in [click](https://note.youdao.com/share/?id=fc13cbac5a4df6df599fd14cbf590fa5&type=note#/) – Archon Aug 02 '18 at 05:14

How to deduplicate in Presto

5 Answers5

Linked