0

I've machine learning model in production that has its predictions being used to make a Azure ML File Dataset. The Dataset is compose by 94 files and has the size of 8,618 MiB. I'm using a compute instance of the time STANDARD_E4S_V3`

and trying to get the Dataset with the following python code.


from azureml.core import Workspace, Dataset

ws = Workspace.from_config()
dataset = Dataset.get_by_name(ws, name='features_for_predictions_modelo_ativacao')
df = dataset.to_pandas_dataframe()

I have already past almost 10 min and the dataset was not even stored as python variable. Is this happen because my df is to large or because my compute instance is not that strong?

  • Are you able to see your data in Azure ML Studio-> Data -> Data Assets ->Explore. – Ram Jan 23 '23 at 10:00
  • Are you seeing any error in UI? – Ram Jan 23 '23 at 10:01
  • No, its just taking forever to load. If I use `num_rows = 10000` which is the size of the table and do `df = dataset.take(num_rows).to_pandas_dataframe()` I can see the dataset but since the jobs don't do that they keep running forever – Gabriel Padilha Jan 26 '23 at 14:25

1 Answers1

0

You can preview the files from the Data Assets Explore tab from UI as shown below.

enter image description here

Ram
  • 2,459
  • 1
  • 7
  • 14
  • I can preview the files using this tab. But I can't use azure python sdk to transform the dataframe into a pandas df or use this df in any job (it keeps running forever). If I use `num_rows = 10000` which is the size of the table and do `df = dataset.take(num_rows).to_pandas_dataframe()` I can see the dataset but since the jobs don't do that they keep running forever – Gabriel Padilha Jan 26 '23 at 14:26
  • 1
    You can increase the compute resources of the instance by using a larger compute instance, it may not be necessary to load the entire dataset into memory. If you're only working with a subset of the data or if you're only performing certain operations on the data, you may be able to work with the dataset directly without loading it into a pandas dataframe. – Ram Jan 27 '23 at 07:15