0

I am using python version of the polars library to read a parquet file with large no of rows . Here is the link to the library - https://github.com/pola-rs/polars

I am trying to read a parquet file from Azure storage account using the read_parquet method . I can see there is a storage_options argument which can be used to specify how to connect to the data storage.Here is the definition of the of read_parquet method -

def read_parquet(
    source: str | Path | BinaryIO | BytesIO | bytes,
    columns: list[int] | list[str] | None = None,
    n_rows: int | None = None,
    use_pyarrow: bool = False,
    memory_map: bool = True,
    storage_options: dict[str, object] | None = None,
    parallel: ParallelStrategy = "auto",
    row_count_name: str | None = None,
    row_count_offset: int = 0,
    low_memory: bool = False,
    pyarrow_options: dict[str, object] | None = None,
) -> DataFrame:

Can anyone let me know what values do I need to provide as part of the storage_options to connect to the Azure storage account if I am using a system assigned managed identity. Unfortunately I could not find any example for this . Most of the examples are using connection string and access keys and due to security reasons I cannot use them.

edit : I just came to know that the storage_options are passed to another library called ffspec. But I have no idea about it.

Niladri
  • 5,832
  • 2
  • 23
  • 41
  • This is something that is handled by `fsspec` not by Polars. Maybe these links help you: https://github.com/fsspec/adlfs/issues/226 & https://github.com/fsspec/adlfs ? – jvz Oct 22 '22 at 15:45

2 Answers2

2

This code should work:

import pandas as pd
storage_options = {'account_name' : '<account>', 'sas_token' : '<token>'}
df = pd.read_parquet('abfs://<container>@<account>.dfs.core.windows.net/<parquet path>', storage_options = storage_options)
  • 1
    Thanks , But I got my solution. I am not using SAS token. I am using managed identity. I will post my answer. – Niladri Jan 31 '23 at 16:51
2

I finally figured out the solution, anyone who is looking to use managed identity to connect to azure data lake storage gen2 account follow the below steps. As someone mentioned in the comments, polars is using fsspec and adlfs python library to connect to remote files in Azure Cloud. To connect using managed identity we can use the below code -

import polars as pl

storage_options={'account_name': ACCOUNT_NAME, 'anon': False}
df = pl.read_parquet(path=<remote-file-path>,columns=<list of columns>,storage_options=storage_options)

This will try to use DefaultAzureCredential from azure.identity library to connect to the storage account. If you already have managed identity enabled for your Azure resource with proper RBAC permission, you should be able to connect.

Documentation : https://github.com/fsspec/adlfs#setting-credentials

Niladri
  • 5,832
  • 2
  • 23
  • 41
  • 1
    Wanted to add that you should also actually install `adlfs` and `fsspec`, as without these dependencies, the code above won't trigger the right subtroutines. Polars doesn't depend on these libraries (it labels them as "heavy/optional third party libs"), but if you have them, you get these extra features. – Oliver W. Jun 21 '23 at 22:13
  • @OliverW. correct – Niladri Jul 07 '23 at 16:33