Error in the MS Azure autoML preparation - wrong file format / encoding?

Question

I am trying to deploy the MS Azure automated machine learning as per the following Github example:

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/classification-bank-marketing

I changed the code there to feed it with my data, but I am getting the following error when executing the autoML run:

automl.client.core.common.exceptions.DataprepException: Could not execute the specified transform.

coming from the: File "/azureml-envs/azureml_e9e27206cd19de471f4e5c7a1171037e/lib/python3.6/site-packages/azureml/automl/core/dataprep_utilities.py", line 50, in try_retrieve_pandas_dataframe_adb

Now, I thought there is sth. wrong with my data, but then I performed the following experiment with the original csv file:

1-st execution as in the Github example, building the dataflow directly based on the http link 2-nd execution building the dataflow based on the same csv, but downloaded to my share.

In the second case I got the same error as with my data. This would mean, that the Azure autoML run / dataflow / preparation process accepts only specific file format, which got changed when saving to my drive. I am not sure if this is about encoding or anything else. Could you please advice?

########################################
#Case 1, Error returned

data= "\\\dwdf219\\...\\bankmarketing_train.csv"
dflow = dprep.auto_read_file(data)
dflow.get_profile()
X_train = dflow.drop_columns(columns=['y'])
y_train = dflow.keep_columns(columns=['y'], validate_column_exists=True)
dflow.head()

# Train
automl_settings = {
    "iteration_timeout_minutes": 10,
    "iterations": 5,
    "n_cross_validations": 2,
    "primary_metric": 'AUC_weighted',
    "preprocess": True,
    "max_concurrent_iterations": 5,
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             path = project_folder,
                             run_configuration=conda_run_config,
                             X = X_train,
                             y = y_train,
                             **automl_settings
                            )     

remote_run = experiment.submit(automl_config, show_output = True)


########################################
#Case 2, all works fine

data = "https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/bankmarketing_train.csv"
dflow = dprep.auto_read_file(data)
dflow.get_profile()
X_train = dflow.drop_columns(columns=['y'])
y_train = dflow.keep_columns(columns=['y'], validate_column_exists=True)
dflow.head()

# Train ...
###################################

score 0 · Answer 1 · answered Jul 20 '19 at 00:48

For a remote run, the file passed to dprep is used on the remote so it must be accessible on the remote (Linux).

The Linux remote understands https and data store but can’t handle a Windows style file share. (\\dwdf219\...\bankmarketing_train.csv in this case)

A solution is to pass the data with data store.

You can upload to data store using:

ds = ws.get_default_datastore()
ds.upload(src_dir='./myfolder', target_path='mypath', overwrite=True, show_progress=True)

and then use a data store reference in the auto_read_file:

dflow = dprep.auto_read_file(path=ds.path('mypath/bankmarketing_train.csv'))

Sample notebook auto-ml-remote-amlcompute.ipynb shows this.

Error in the MS Azure autoML preparation - wrong file format / encoding?

1 Answers1