Invalidate metadata/refresh imapala from spark code

Question

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.

Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.

What would be the most efficient approach?

Oozie is just too slow (30 sec overhead? no thanks)
An SSH action to an (edge) node seems like a valid solution but feels "hackish"
I don't see a way to do this from the hive context in Spark either.

About Spark `HiveContext` : it enables a job to interact with the Hive **Metastore**, in client/server mode. But it is completely unaware of what other jobs are doing against the Metastore at the same time -- i.e. other Spark jobs, Pig jobs, Impala queries, Hive CLI queries, HiveServer2 queries, Hue browsing sessions... — Samson Scharfrichter, Jul 06 '16 at 10:33

score 11 · Accepted Answer · answered Jul 06 '16 at 11:15

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)

You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:

download the latest Cloudera JDBC driver for Impala
install it on the server where you run your Spark job
list all the JARs in your *.*.extraClassPath properties
develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way

Hopefully Google will find some examples of JDBC/Scala code such as this one

score 1 · Answer 2 · answered Oct 16 '19 at 10:14

Seems this has been fixed by Impala 3.3.0 (cf. Section "Metadata Performance Improvements" here):

Automatic invalidation of metadata

With automatic metadata management enabled, you no longer have to issue INVALIDATE / REFRESH in a number of conditions.

In Impala 3.3, the following additional event in Hive Metastore can trigger automatic INVALIDATE / REFRESH of Metadata:

INSERT into tables and partitions from Impala or from Spark on the same or multiple cluster configuration

score 0 · Answer 3 · edited Apr 16 '19 at 10:06

0

all the above steps are not required, you can write the below code and execute invalidate metadata query to impala table.

impala_node_ip_address = "XX.XX.XX.XX"
impala Query = "impala-shell -i "+"\"" + str(impala_node_ip_address) + "\"" + " -k -q " + "\""+"invalidate metadata DBNAME"+"." + "TableName" + "\""

edited Apr 16 '19 at 10:06

Ramprasath Selvam

3,868
3
25
41

answered Apr 16 '19 at 09:42

Shashank hande

11

As stated in the question, I would like to do I from my code, not as an external script. There is/was no other option than going the JDBC route. – Havnar Apr 17 '19 at 09:49

Invalidate metadata/refresh imapala from spark code

3 Answers3