0

I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.

I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.

I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).

What is the easiest way to do that?

My naive approach as a beginner might be

  1. Read the folder using spark.read.format("csv).load("{container}@{storage}..../directory)

  2. Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}@{storage}.../directory_parquet)

And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.

Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.

  1. delete the existing data files if desired (or have it duplicated)

Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.

*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName tableName=subdirectoryName

The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.

brian_ds
  • 317
  • 4
  • 12

1 Answers1

1

Perhaps an external table is what you're looking for:

df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename") 

This post has more info on external tables vs managed tables.

Powers
  • 18,150
  • 10
  • 103
  • 108
  • This gets me basically there. Thanks. One question. So when I run it as an option(path,...).saveAsTable() I get the files into storage, and metadata into hive as required. But I noticed that if I go into the storage account and delete a parquet file there, it doesn't reflect in the databricks notebook. (Process: add rows, add more rows, delete first file >>> databricks thinks both files are still there even after a vacuum command.) I know that is bad form, but because I am working on a sandbox, I want to understand all the behaviors. – brian_ds Sep 28 '22 at 18:17
  • So I assume best practice is never delete data inside a delta table from within a storage account? But because I'm not the data engineer, I can't prohibit that in case a file is corrupted or needs to be modified. So how do you handle a scenario such as above where a file gets deleted outside databricks? – brian_ds Sep 28 '22 at 18:25
  • @brian_ds - thanks for the responses. If my question answers your question, can you please mark the question as answered? If you have additional follow-up questions, can you please start a new question thread? Thank you. – Powers Sep 28 '22 at 19:09