19

I have a partitioned table - with 201 partitions. I need to find latest partition in this table and use it to post process my data. The query to find list of all partitions is :

use db;
show partitions table_name; 

I need a query to find the latest of these partitions. The partitions are in format

ingest_date=2016-03-09

I tried using max() which gave me a wrong result. I do not want to traverse through entire table by doing

select max(ingest_date) from db.table_name; 

This would give me the expected output.. but kill the whole point of having partitions in the 1st place.

Is there a more efficient query to get the latest partition for HIve table ?

zero323
  • 322,348
  • 103
  • 959
  • 935
underwood
  • 845
  • 2
  • 11
  • 22
  • Possible duplicate of [Hive: how to show all partitions of a table?](http://stackoverflow.com/questions/15616290/hive-how-to-show-all-partitions-of-a-table) – zero323 Mar 18 '16 at 23:25
  • Alas, Hive does not expose the metadata as virtual tables - nothing like a `SYS` or `INFORMATION_SCHEMA` database. The only efficient way to list the partitions of a Hive table is to use the **Metastore java API**. Or to reverse-engineer the way partitions are organized **in HDFS** _(quite straightforward actually)_ then list the directories, hoping HDFS and the Metastore are 100% in sync. – Samson Scharfrichter Mar 19 '16 at 10:06
  • Or, if you can query the Metastore database directly *(probably in MySQL or PostgreSQL)*, reverse-engineer the data model and run any SQL you like. – Samson Scharfrichter Mar 19 '16 at 10:09

4 Answers4

12

You can use "show partitions":

hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1 | cut -d'=' -f2

This will give you "2016-03-09" as output.

pkgajulapalli
  • 1,066
  • 3
  • 20
  • 44
  • 1
    thanks for the response... but a quick question around this.... is there a chance, as the no.of partitions of a hive table grows... example say a lakh... would that command still work or execute with in the short times ? – Raja Nov 29 '17 at 16:56
  • Raja, it will still work. I would not assume a return order though, so would advise a smarter script to find the latest – MikeKulls Mar 15 '18 at 04:41
  • **ERROR** even eliminating the `set` directive from hive command. *Exception in thread "main" java.lang.RuntimeException: org.apache.tez.dag.api.TezException: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User a0087517 does not have permission to submit application_1570940848794_129274 to queue default at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)* – Peter Krauss Oct 24 '19 at 13:41
10

If you want to avoid running the "show partitions" in hive shell as suggested above, you can apply a filter to your max() query. That will avoid doing a fulltable scan and results should be fairly quick!

select max(ingest_date) from db.table_name where ingest_date>date_add(current_date,-3) will only scan 2-3 partitions.

Kash
  • 351
  • 6
  • 11
  • 2
    I found that ```select max(ingest_date) from db.table_name``` was very useful in my problem as it only returns one value from the all the possible partitions within a hive table. – Bluebird Sep 27 '19 at 19:39
  • 1
    only `select max(x) from table;` is a full table scan as there is no information to tell hive where to look. so it will scan every partition for the max of each then take the max those. so yes it gives the correct answer but very inefficiently. the point of using partitions is to avoid a full table scan. this answer uses `where ingest_date = date_add(current_date,-3)` where ingest_date is a partition key so that only the latest few partitions as searched. – simbo1905 May 03 '20 at 11:38
2

if you know your table location in hdfs. This is the most quick way without even opening the hive shell.

You can check you table location in hdfs using command;

show create table <table_name>

then

hdfs dfs -ls <table_path>| sort -k6,7 | tail -1

It will show latest partition location in hdfs

Strick
  • 1,512
  • 9
  • 15
0

It looks like there is no way to query for the last partition via Hive (or beeline) CLI that checks only metadata (as one should expect).

For the sake of completeness, the alternative I would propose to the bash parsing answer is the one directly querying the metastore, which can be easily extended to more complex functions of the ingest_date rather than just taking the max. For instance, for a MySQL metastore I've used:

SELECT MAX(PARTITIONS.PART_NAME) FROM
DBS
INNER JOIN
TBLS ON DBS.DB_ID = TBLS.DB_ID
INNER JOIN
PARTITIONS ON TBLS.TBL_ID = PARTITIONS.TBL_ID
PARTITIONS DBS.NAME = 'db'
PARTITIONS TBLS.TBL_NAME = 'my_table'

Then the output will be in format partition_name=partition_value.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
josoler
  • 1,393
  • 9
  • 15