0

I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I trigger "Generate Profile" operation for the dataset, it throws following error while handling empty parquet file and then the profile generation stops.

User program failed with ExecutionError: 
Error Code: ScriptExecution.StreamAccess.Validation
Validation Error Code: NotSupported
Validation Target: ParquetFile
Failed Step: 77866d0a-8243-4d3d-8bc6-599d466488dd
Error Message: ScriptExecutionException was caused by StreamAccessException.
  Failed to read Parquet file at: <my_blob_path>/20211217.parquet
    Current parquet file is not supported.
      Exception of type 'Thrift.Protocol.TProtocolException' was thrown.
| session_id=6be4db0b-bdc1-4dd6-b8a6-6e9466f7bc54

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet), it results in an empty DF (df.empty == True).

Any suggestion to avoid this error will be appreciated.

Update The issue has been fixed in the following version:

  • azureml-dataprep : 3.0.1
  • azureml-core : 1.40.0
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60

2 Answers2

1

Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.

  • I can confirm that both this issue is fixed in the latest version: azureml-dataprep (3.0.1), azureml-core (1.40.0). Thanks for all your work! – Arnab Biswas Mar 30 '22 at 12:32
0
Error Code: ScriptExecution.StreamAccess.Validation

Above error is caused because you are not able to access ADLS.

You can create Azure App Identity and assign read access to ADLS. Now register ADLS as a datastore in workspace using client id and secret of app identity. After these steps your code will be able to access datastore.

Refer - https://learn.microsoft.com/en-us/azure/machine-learning/how-to-network-security-overview#configure-a-datastore-to-use-managed-identity

Abhishek K
  • 3,047
  • 1
  • 6
  • 19
  • That's not correct. I am able to generate profile on a Tabular Dataset created on a subset of the parquet files (located in the same ADLS). This subset doesn't have any empty parquet file. – Arnab Biswas Feb 10 '22 at 12:51