Apache Flink: How to remove duplicates within select query?

Question

How to remove duplicates within a SELECT query using Apache fFlink?

My table is:

and I want to remove duplicates in ID with respect to keeping maximum in range

Are you running the query on static data set or on a data stream? — Fabian Hueske, Mar 01 '18 at 12:01

score 1 · Accepted Answer · answered Mar 01 '18 at 12:33

Assuming that the query is run on a static data set, it can be solved with regular SQL. Since Flink implements standard SQL, this query is not Flink-specific but would run on any relational database system.

SELECT DISTINCT t.id, t.name, t.range 
  FROM t, (SELECT id, MAX(range) AS maxRange FROM t GROUP BY id) s
  WHERE t.id = s.id AND t.range = s.maxRange

Note that you will lose duplicates if there is an id for which there are more than one row with the maximum range.

Apache Flink: How to remove duplicates within select query?

1 Answers1