I'd like to get the first N records from each partition in a hive partitioned table without having to deserialize every record in the table. I've seen a number of solutions for other databases, but not one that works in Hive without running over every single record.
A minimal example:
create table demo ( val int )
partitioned by (partid int);
insert into demo partition (partid=1)
select stack(5, 1, 2, 3, 4, 5);
insert into demo partition (partid=2)
select stack(5, 100, 200, 300, 400, 500)
insert into demo partition (partid=3)
select stack(5, -1, -2, -3, -4, -5)
insert into demo ...
...
I'd like to get the result of
select * from partition_demo where partid = 1 limit 1
union all
select * from partition_demo where partid = 2 limit 1
union all
select * from partition_demo where partid = 3 limit 1
union all
...
without writing every single clause, and without deserializing all of the data in each partition (which seems to happen using RANK OVER
).