Using Microsoft's Open Datasets (here), I'd like to create (external) tables in my Databricks env available for consumption in Databricks SQL env and external (BI tools) to this parquet source.
Bit confused on the right approach. Here's what I've tried.
Approach 1: I've tried to create a mount_point (/mnt/taxiData) to the open/public azure store from which I'd use the normal CREATE TABLE dw.table USING PARQUET LOCATION '/mnt/taxi'
using the following python code, however, I get an ERROR: Storage Key is not a valid base64 encoded string
.
Note: This azure store is open, public. There is no key, no secret.get required.
storageAccountName = "azureopendatastorage"
storageAccountAccessKey = r""
blobContainerName = "nyctlc"
dbutils.fs.mount(
source = "wasbs://{}@{}.blob.core.windows.net".format(blobContainerName, storageAccountName),
mount_point = "/mnt/taxiData",
extra_configs = {'fs.azure.account.key.' + storageAccountName + '.blob.core.windows.net': storageAccountAccessKey}
)
Approach - 2: Use: CREATE TABLE dw.table USING PARQUET LOCATION 'wasbs://nyctlc@azureopendatastorage.blob.core.windows.net/yellow/'
since the wasbs location is open/public.
Result - I get no error, but I get an external empty table with the correct schema, but no data.
Approach -3: If I create a dataframe (df = spark.read.parquet('wasbs://..), it works fine. I get a DF with data. However, if I then add .write.saveAsTable(dw.table) - it pulls the data from the Open Store over to my /dbfs store. Which is not what I want.
Best approach & fix?