0

I am trying to use the Azure Machine Learning Python SDK V2 to run a project with. Step one is ingesting a Delta table. Looking at the docs, this seems like a standard part of the SDK: you create an in-memory MLTable artefact, then can interact with it and save it to a Data Asset.

However, even the most minimal example fails. Running this:

from mltable import from_delta_lake

gives me an error telling me that from_delta_lake does not exist. I can tab-complete mltable and see other from_* methods exist that are mentioned in the docs, but just not Delta Lake. I am on MLTable 0.1.0b4 - though I have tried 0b1, 0b2 and 0b3 all with similar results. I am using the standard Azure Machine Learning Workspace environment Python 3.10 SDK V2.

Has anyone else encountered this? Does the Delta Lake methods appear to you? thanks.

Andrew
  • 539
  • 5
  • 20

1 Answers1

1

read_from_delta_lake was made available from version 1.0.0 onwards. Therefore, you should update to the latest version, using:

pip install -U mltable

Using the mltable Python SDK, you can read Delta files into Pandas using:

import mltable

# this example uses abfss protocol, but you can also use a long-form 
# azureml URI, for example
# azureml://subscriptions/<subid>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<datastore_name>/paths/<path>"

uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/<path>"

tbl = mltable.from_delta_lake(uri, timestamp_as_of="2023-10-01T00:00:00Z")
df = tbl.to_pandas_dataframe()

If you use a long-form AzureML Datastore URIs (azureml://), you can copy-paste these by navigating to the Data browsing UI in AzureML Studio, as described in the tip below:

Copy-and-paste an AzureML Datastore URI

You can also create an MLTable file that defines the transformation:

type: mltable

# Paths are relative to the location of the MLTable file and should *not* be absolute paths.
# The path below - ./ - assumes the MLTable file will be stored in the same folder
# containing the delta logs, parquet files, etc.

paths:
  - folder: ./ 

transformations:
  - read_delta_lake:
      timestamp_as_of: '2022-08-26T00:00:00Z'

You can add more transforms to the MLTable file (e.g. take a sample, keep columns, etc). You should store this in the same folder as the data on cloud storage:

/
└── my-data
    ├── _change_data
    ├── _delta_index
    ├── _delta_log
    ├── MLTable    << MLTable file co-located with data
    ├── part-0000-xxx.parquet
    └── part-0001-xxx.parquet

This make the MLTable a self-contained artifact where all that is needed is stored in that one folder; regardless of whether that folder is stored on your local drive or in your cloud store or on a public http server. A consumer can simply load the table from the folder and materialize into Pandas using:

import mltable

# Here the URI points to the *folder* on cloud storage that contains the MLTable file
uri = "abfss://<filesystem>@<account_name>.dfs.core.windows.net/my-data"
tbl = mltable.load(uri)
tbl.to_pandas_dataframe()

You can then create a data asset in AzureML using the Python SDK (or CLI or Studio UI):

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

my_data = Data(
    path="abfss://<filesystem>@<account_name>.dfs.core.windows.net/my-data",
    type=AssetTypes.MLTABLE,
    description="<description>",
    name="<name>",
    version='<version>'
)

ml_client.data.create_or_update(my_data)

An AzureML data asset is analogous to using bookmarks (favourites) in your web browser: rather than having to remember long URIs (storage locations) to your most used data, you can create a data asset and then access by using a friendly name.

Sam Kemp
  • 21
  • 2