5

We are using MySQL (Cloud SQL) for the metadata repository for Dataproc. This doesn't store any pieces of information of GCS files which are not part of Hive external tables.

Can anyone suggest the best way to store all the file/data details in one catalog in Google Cloud?

Jesse Scherer
  • 1,492
  • 9
  • 26
user3858193
  • 1,320
  • 5
  • 18
  • 50
  • you are trying to store Metadata of Hive Tables in Google Cloud? – Pradeep Bhadani Jan 31 '20 at 17:12
  • I am string hive/spark metadata in cloud sql. But now not able to store the metadata of the gcs files (which are not part of hive external table) – user3858193 Jan 31 '20 at 19:06
  • 1
    Any specific reason for not creating Hive external table on these GCS file? You can always construct your metadata into query and store in relational database. But creating Hive external table is easy. – Pradeep Bhadani Jan 31 '20 at 19:09
  • The files which are used for any transformations have external tables created and used in data proc processing. But the files which we get from upstream and they just get loaded in Big query don't have the external table created. There is Lag of between the files arrives and load happens. I wanted something like Glue which can be used. That metadata will be used for all across all the dataset we recieve or generate. – user3858193 Feb 02 '20 at 13:06

2 Answers2

2

Google Cloud Data Catalog beta doesn't work with GCS or Hive Metastore. See this doc

Tagging Cloud Storage assets (for example, buckets and objects) is unavailable in the Data Catalog beta release.

But it works with BigQuery, see this quickstart example.

Dagang
  • 24,586
  • 26
  • 88
  • 133
  • This answer appears out of date. There are now these docs about sync'ing Data Catalog and Dataproc on GCP https://cloud.google.com/dataproc-metastore/docs/data-catalog-sync – rjurney May 26 '22 at 15:14
0

dvorzhak,

Data Catalog became GA: Data Catalog GA

And they have updated the docs for Filesets: Data Catalog Filesets

Also if you want to create Data Catalog assets for each of your cloud storage objects, you may use this open source script: datacatalog-util which has an option to create Entries for your files.

Finally there's an open source connector script, if you want to ingest Hive Databases/Tables into Data Catalog.

mesmacosta
  • 466
  • 3
  • 10