I have below dataset in a Hive table name - PUBLISH
N.B. There can be duplicate records in PUBLISH
DATE |HOUR|SOURCE|COL_TIMESTAMP |ID
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.019 GMT|ID_333
20200101|15 |C |2020-01-01 15:18:53.016 GMT|ID_444
20200102|00 |A |2020-01-01 15:18:53.016 GMT|ID_444
I want to generate below output based on a particular date, hour and source
E.g. For (DATE=20200101
& HOUR=14
& SOURCE=A
), the output should be:
DATE |HOUR|SOURCE|MIN_TIMESTAMP |START_ID|MAX_TIMESTAMP |END_ID|RECORD_CNT
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222 |2020-01-01 14:18:53.019 GMT|ID_333|3
N.B. The timestamp has 'GMT' at the end. Also I am trying to run the query using spark java code. Please suggest a hive query which will show good performance, when the data size is huge.