1

Using Microsoft's Open Datasets (here), I'd like to create (external) tables in my Databricks env available for consumption in Databricks SQL env and external (BI tools) to this parquet source.

Bit confused on the right approach. Here's what I've tried.

Approach 1: I've tried to create a mount_point (/mnt/taxiData) to the open/public azure store from which I'd use the normal CREATE TABLE dw.table USING PARQUET LOCATION '/mnt/taxi' using the following python code, however, I get an ERROR: Storage Key is not a valid base64 encoded string.

Note: This azure store is open, public. There is no key, no secret.get required.

storageAccountName = "azureopendatastorage"
storageAccountAccessKey = r""
blobContainerName = "nyctlc"
dbutils.fs.mount(
    source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
    mount_point = "/mnt/taxiData",
    extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
  )

Approach - 2: Use: CREATE TABLE dw.table USING PARQUET LOCATION 'wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/yellow/' since the wasbs location is open/public.

Result - I get no error, but I get an external empty table with the correct schema, but no data.

Approach -3: If I create a dataframe (df = spark.read.parquet('wasbs://..), it works fine. I get a DF with data. However, if I then add .write.saveAsTable(dw.table) - it pulls the data from the Open Store over to my /dbfs store. Which is not what I want.

Best approach & fix?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
ExoV1
  • 97
  • 1
  • 7
  • Approach #1: My bad, should have used the following key provider config: fs.azure.sas.' + blobContainerName + '.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey (this worked) – ExoV1 Feb 03 '22 at 13:21

1 Answers1

1

For Approach 1, I think that the check is too strict in the dbutils.fs.mount - it makes sense to report this as an error to Azure support.

Approach 2 - it's not enough to create a table, it also needs to discover partitions (Parquet isn't a Delta where partitions are discovered automatically). You can do that with the MSCK REPAIR TABLE SQL command. Like this:

MSCK REPAIR TABLE dw.table

and after it's finished you'll see the data:

enter image description here

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thank you. That worked. QQ: Will I need to run the MSCK REPAIR script again tomorrow (job) as Microsoft adds new data or is this a 'one and done' process? – ExoV1 Jan 27 '22 at 18:48
  • 1
    For parquet you need to run it periodically when the new partitions are added to a table. Unfortunately it's a limitation of Hive/Parquet – Alex Ott Jan 27 '22 at 19:03