0

I'd like to get the first N records from each partition in a hive partitioned table without having to deserialize every record in the table. I've seen a number of solutions for other databases, but not one that works in Hive without running over every single record.

A minimal example:

create table demo ( val int ) 
    partitioned by (partid int);
insert into demo partition (partid=1) 
    select stack(5, 1, 2, 3, 4, 5);
insert into demo partition (partid=2) 
    select stack(5, 100, 200, 300, 400, 500)
insert into demo partition (partid=3) 
    select stack(5, -1, -2, -3, -4, -5)
insert into demo ...
    ...

I'd like to get the result of

select * from partition_demo where partid = 1 limit 1
union all
select * from partition_demo where partid = 2 limit 1
union all
select * from partition_demo where partid = 3 limit 1
union all
...

without writing every single clause, and without deserializing all of the data in each partition (which seems to happen using RANK OVER).

Michael K
  • 2,196
  • 6
  • 34
  • 52

0 Answers0