3

I am using Dataproc metastore as a Metastore service with GCP. How can I interact with it to fetch list of databases and tables from it? Is it possible to do this without running dataproc cluster ?

Edit - I have to fetch the metadata without running Dataproc cluster. Since I am using Dataproc Metastore service to store metadata, I need to fetch metadata directly from it.

David Rabinowitz
  • 29,904
  • 14
  • 93
  • 125
Rahul Sharma
  • 147
  • 9
  • With Dataproc cluster, you can ssh into the master node, then run `hive`, then run `show databases` and `show tables`. – Dagang Jun 10 '21 at 16:23
  • @Dagang thank you for your response but as I mentioned in my question, I need to do this without running Dataproc cluster. Since I am using Dataproc Metastore service to store the metadata, is it possible to fetch metadata from it without running the dataproc cluster ? – Rahul Sharma Jun 11 '21 at 09:37
  • Try using their [rest API](https://cloud.google.com/dataproc-metastore/docs/reference/rest?hl=en)? Looking at the docs it's not obvious which endpoint you need, my guess would be `projects.locations.services.metadataImports.{list,get}`. Since you seem to have a running instance, you should be able to explore the contents yourself and find what you need. – Hitobat Jun 11 '21 at 10:38
  • Actually reading the docs more, I think the Google APIs only control creating/destroying the service. So first you must find the IP/port of your specific metadata instance. Then from Apache Hive docs, metadata store is Thrift service with the [interface](https://github.com/apache/hive/blob/master/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift). So you must connect using generated Thrift files (e.g. [Java classes](https://github.com/apache/hive/tree/master/standalone-metastore/metastore-common/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api)). – Hitobat Jun 11 '21 at 10:51
  • If you want to interact with the Metastore Service you'll have to use the Thrift API given by the Dataproc Metastore service. As Hitobat said, you can use Java classes as long as you have IP reach-ability, there is also integration with Data Catalog so you can explore the schema, etc but no actual data is shown there. Along with Java classes, you can interact via beeline, spark sql, hive CLI, but the easiest way by far to do this is to interact via Dataproc. – cjmoberg Jun 16 '21 at 17:53

1 Answers1

2

The Dataproc Metastore API is used to manage the Dataproc Metastore service instance (get/create/update etc). As mentioned in one of the comments, you can use the thrift URI (you will find the URI under the configuration tab of the metastore service if you are using the console).

Once you have a thrift client that connects to the thrift URI, you can fetch databases or tables. Although you can use the thrift API to create databases and tables as well, the typical use case is to configure a big data processing engine/framework like spark or hive to use the metastore and not directly interact with the metastore.

Karthik
  • 21
  • 1