0

I'm new to Delta Lake and considering to use Delta lake for one of the project with S3 or GCS as file storage. I would like to understand how the data cataloging works. Does the open source delta lake automatically creates and maintains data catalog when we create delta tables or do we need call any APIs to register the table metadata with catalog? Any pointers to documentation would be helpful.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

1 Answers1

1

Delta Lake just writes files into some storage (cloud storage, HDFS, etc.). By itself it doesn't provide any data cataloging functionality, but it's could be used with Hive Metastore (via Spark), AWS Glue, etc. The answer really depends on what stack you're using for your project.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • lets say I'm using GCP. In this case, I write to delta table with storage as GCS. In this case what are the options available to register delta table with Hive catalog? Any direction or pointers to documentation would be grateful (I tried finding one, could not locate). – user16798185 Jun 21 '23 at 04:16
  • I don’t know about gcp catalog offering, but I’m not an expert in gcp in that area – Alex Ott Jun 21 '23 at 20:51
  • what about aws? I read about saveastable on delta lake and other options, I understand how catalog syncs after the initial setup. However would like to understand more around choosing/setting up catalog for the first time. – user16798185 Jun 22 '23 at 04:15
  • 1
    Databricks comes with the built-in Hive Metastore, although Unity Catalog is recommended as it's closer to the real cataloging solution. On AWS that could be AWS Glue as well. – Alex Ott Jun 22 '23 at 06:36