3

Goal

I'm trying to check if a partially-specified partition exists in a hive table.

Details

I have a table with two partition keys, source and date. Before a task can execute, I need to check and see if any partition exists for a certain date (source is not specified).

Attempts

I can do this easily with luigi's built-in hive partition target and the default client:

>>> import luigi.hive as hive
>>> c = hive.HivePartitionTarget('data',{"date":"2016-03-31"})
>>> c.exists()
True
>>> c = hive.HivePartitionTarget('data',{"date":"2016-03-32"})
>>> c.exists()
False

But the default client is really, really slow because it is spinning up a command-line instance of hive and running a query. So I tried to swap the default client out for the thrift one, and this happend:

>>> d = hive.HivePartitionTarget('data',{"date":"2016-03-31"}, client=hive.MetastoreClient())
>>> d.exists()
False

It appears that the two clients interpret partially-specified partitions differently.

I have already written my own client that inherits from MetastoreClient and adds some additional functions I needed in the past, so I don't mind adding a partially-specified partition check of my own design. And it looks like the client has the functions I need:

>>> from pprint import pprint
>>> import luigi.hive as hive
>>> client = hive.HiveThriftContext().__enter__()
>>> pprint([command for command in dir(client) if 'partition' in command])
[ # Note: I deleted the irrelevant commands, this was a really long list
 'get_partition',
 'get_partition_by_name',
 'get_partition_names',
 'get_partition_names_ps',
 'get_partition_with_auth',
 'get_partitions',
 'get_partitions_by_filter',
 'get_partitions_by_names',
 'get_partitions_ps',
 'get_partitions_ps_with_auth',
 'get_partitions_with_auth',
 # Even more commands snipped here
 ]

It looks like the command get_partitions_by_filter might do exactly what I want, but I can't find any documentation for it anywhere aside from auto-generated lists of the types it expects. And I've run in to similar problems with the simpler functions: when I fully-specify partitions that I know exist, I can't get get_partition or get_partition_by_name to find them. I am sure this is because I am not providing arguments in the right format, but I don't know what the correct format is and my patience has run out with regards to guessing.

What is the syntax for HiveThriftContext's get_partitions_by_filter command?

Follow up question: How did you figure this out?

Noah
  • 495
  • 2
  • 7
  • 21
  • 2
    I don't mess with Python but the JavaDoc for method `HiveMetastoreClient.getPartition(,,` is there: http://hive.apache.org/javadocs/r1.1.1/api/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.html#getPartition(java.lang.String,%20java.lang.String,%20java.util.List)
    – Samson Scharfrichter Jun 02 '16 at 17:26
  • That is exactly what I needed. Thank you! – Noah Jun 02 '16 at 18:03

0 Answers0