How to read a hive partition into an Apache Crunch pipeline?

Question

I am able to read text files in hdfs into apache crunch pipeline. But now I need to read the hive partitions. The problem is that as per our design, I am not supposed to directly access the file. Hence, now I need some way by which I can access the partitions using something like HCatalog.

score 0 · Answer 1 · answered Nov 21 '14 at 23:02

You can use the org.apache.hadoop.hive.metastore API or HCat API. Here is a simple example of using hive.metastore. You would have to make call to either or before starting on your Pipeline unless you want to join to some Hive partition in the mapper/reducer.

HiveMetaStoreClient hmsc = new HiveMetaStoreClient(hiveConf)
HiveMetaStoreClient hiveClient = getHiveMetastoreConnection();
List<Partition> partitions = hiveClient.listPartittions("default", "my_hive_table", 1000)
for(Partition partition: partitions) {
   System.out.println("HDFS data location of the partition: " + partition.getSd().getLocation())
}

The only other thing you will need is to export the hive conf dir:

export HIVE_CONF_DIR=/home/mmichalski/hive/conf

How to read a hive partition into an Apache Crunch pipeline?

1 Answers1