0

I am trying to connect and authenticate to an existing Delta Table in Azure Data Lake Storage Gen 2 using the Delta-rs Python API. I found the Delta-rs library from this StackOverflow question: Delta Lake independent of Apache Spark?

However, the documentation for Delta-rs (https://delta-io.github.io/delta-rs/python/usage.html and https://docs.rs/object_store/latest/object_store/azure/enum.AzureConfigKey.html#variant.SasKey) is quite vague regarding the authentication and connection process to Azure Data Lake Storage Gen 2. I am having trouble finding a clear example that demonstrates the required steps.

Can someone provide a step-by-step guide or example on how to connect and authenticate to an Azure Data Lake Storage Gen 2 Delta table using the Delta-rs Python API?

user__42
  • 543
  • 4
  • 13

1 Answers1

1

You can use the following Python code to interact with a Delta Lake on Azure Data Lake Storage (ADLS) using an SAS token for authentication. This code reads a CSV file from an ADLS container, appends its content to a Delta Lake, and prints some metadata.

First, make sure you have the required libraries installed:

pip install deltalake pandas numpy

Then, use this Python script:

import deltalake as dl
from deltalake.writer import write_deltalake
import pandas as pd
import numpy as np

# Define your SAS token, storage account name, container name, and file path
sas_token = "<please_generate_sas_token_using_a_sap_stored_acces_policy>"
storage_account_name = "mystorage"
container_name = "test-container"
csv_file = "test_delta/test_csv_data/products1.csv"
delta_path = "test_delta/light_delta_lake"
 
#csv url
csv_url = f"https://{storage_account_name}.dfs.core.windows.net/{container_name}/{csv_file}?{sas_token}"
 
# Choose the protocol (abfs or abfss)
# https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri
protocol = "abfss"  # Use "abfs" for non-secure connections
 
# Construct the URL for the specified folder
delta_url = f"{protocol}://{container_name}@{storage_account_name}.dfs.core.windows.net/{delta_path}"
 
# Give SAS_TOKEN as storage option (can be set via ENV variable as well)
storage_options = {"SAS_TOKEN": f"{sas_token}"}
 
print(csv_url.replace(sas_token, "<SECRET>"))
print(' ')
print(str(storage_options).replace(sas_token, "<SECRET>"))
print(delta_url.replace(sas_token, "<SECRET>"))

# Read the Delta table from the storage account 
dt = dl.DeltaTable(delta_url, storage_options=storage_options)
 
# Print the schema and file URIs of the Delta table
print(dt.schema())
print(dt.file_uris())
 
# Print the history of the Delta table as a DataFrame
print(pd.DataFrame(dt.history()))
 
# Read the CSV file, modify the data, and convert it to a DataFrame
data = pd.read_csv(csv_url).assign(stars=lambda df: df['rating'].astype(np.int32)).drop(['description', 'ingredients'], axis=1).astype({'rating_count': np.int32})
data.head()
 
# Append the DataFrame to the Delta table
write_deltalake(table_or_uri=dt, data=data, mode="append")
 
# Print the updated file URIs and history of the Delta table
print(dt.file_uris())
print(pd.DataFrame(dt.history()))
user__42
  • 543
  • 4
  • 13