1

I have created a Tabular Dataset using Azure ML python API. Data under question is a bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. When I try to load the dataset using the API TabularDataset.to_pandas_dataframe(), it continues forever (hangs), if there are empty parquet files included in the Dataset. If the tabular dataset doesn't include those empty parquet files, TabularDataset.to_pandas_dataframe() completes within few minutes.

By empty parquet file, I mean that the if I read the individual parquet file using pandas (pd.read_parquet()), it results in an empty DF (df.empty == True).

I discovered the root cause while working on another issue mentioned [here][1].

My question is how can make TabularDataset.to_pandas_dataframe() work even when there are empty parquet files?

Update The issue has been fixed in the following version:

  • azureml-dataprep : 3.0.1
  • azureml-core : 1.40.0
Arnab Biswas
  • 4,495
  • 3
  • 42
  • 60

1 Answers1

1

Thanks for reporting it. This is a bug in handling of the parquet files with columns but empty row set. This has been fixed already and will be included in next release.

I could not repro the hang on multiple files, though, so if you could provide more info on that would be nice.

  • Out of 10K parquet files only few (~100) are empty. If I read the individual parquet file using pandas, the DataFrame has column information but no row information. I have provided sample parquet files (empty as well normal) to support engineers. – Arnab Biswas Mar 05 '22 at 12:38
  • 1
    I think that's what I used to repro :) We would not keep column info when no rows are present, however the reading of multiple files where some are empty would work. Reading such empty file would also work and return you empty pandas dataframe – Andrei Liakhovich Mar 07 '22 at 03:52
  • I can confirm that both this issue is fixed in the latest version: azureml-dataprep (3.0.1), azureml-core (1.40.0). Thanks for all your work! – Arnab Biswas Mar 30 '22 at 12:29