1

I am trying to build a simple demand forecasting model using Azure AutoML in Synapse Notebook using Spark and SQL Context.

After aggregating the item quantity with respect to date and item id, this is what my data looks like this in the event_file_processed.parquet file:

enter image description here

The date range is from 2020-08-13 to 2021-02-08.

I am following this documentation by MS: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast

Here's how I have divided my train_data and test_data parquet files:

%%sql
CREATE OR REPLACE TEMPORARY VIEW train_data
AS SELECT 
*
FROM
event_file_processed
WHERE
the_date <= '2020-12-20'
ORDER BY
the_date ASC`

%%sql
CREATE OR REPLACE TEMPORARY VIEW test_data
AS SELECT 
*
FROM
event_file_processed
WHERE
the_date > '2020-12-20'
ORDER BY
the_date ASC`

%%pyspark
train_data = spark.sql("SELECT * FROM train_data")
train_data.write.parquet("train_data.parquet")

test_data = spark.sql("SELECT * FROM test_data")
test_data.write.parquet("test_data.parquet")`

Below are my AutoML settings and run submission:

from azureml.automl.core.forecasting_parameters import ForecastingParameters

forecasting_parameters = ForecastingParameters(time_column_name='the_date', 
                                               forecast_horizon=44,
                                               time_series_id_column_names=["items_id"],
                                               freq='W',
                                               target_lags='auto',
                                               target_aggregation_function = 'sum',
                                               target_rolling_window_size = 3,
                                               short_series_handling_configuration = 'auto'
                                               )

train_data = spark.read.parquet("train_data.parquet")
train_data.createOrReplaceTempView("train_data")

label = "total_item_qty"

from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging

automl_config = AutoMLConfig(task='forecasting',
                             primary_metric='normalized_root_mean_squared_error',
                             experiment_timeout_minutes=15,
                             enable_early_stopping=True,
                             training_data=train_data,
                             label_column_name=label,
                             n_cross_validations=3,
                             enable_ensembling=False,
                             verbosity=logging.INFO,
                             forecasting_parameters = forecasting_parameters)

from azureml.core import Workspace, Datastore

# Enter your workspace subscription, resource group, name, and region.
subscription_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" #you should be owner or contributor
resource_group = "XXXXXXXXXXX" #you should be owner or contributor
workspace_name = "XXXXXXXXXXX" #your workspace name

ws = Workspace(workspace_name = workspace_name,
               subscription_id = subscription_id,
               resource_group = resource_group)
               
experiment = Experiment(ws, "AML-demand-forecasting-synapse")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()

I am badly stuck in the error below:

Error:

DataException: DataException:
Message: An invalid value for argument [y] was provided.
InnerException: InvalidValueException: InvalidValueException:
Message: Assertion Failed. Argument y is null. Target: y. Reference Code: b7440909-05a8-4220-b927-9fcb43fbf939
InnerException: None
ErrorResponse

I have checked that there are no null or rogue values in total_item_qty, the type in the schema for the 3 variables is also correct.

If you can please give some suggestions, I'll be obliged.

Thanks, Shantanu Jain

1 Answers1

0

Assuming you are not using the Notebooks that the Synapse UI generates. If you use the wizard in Synapse, it will actually generate a PySpark notebook that you can run and tweak. That experience is described here: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-automl

The are two issues:

  1. Since you are running from Synapse, you are probably intending to run AutoML on Spark compute. In this case, you need to pass a spark context to the AutoMLConfig constructor: spark_context=sc

  2. Second, you seem to pass a Spark DataFrame to AutoML as the training data. AutoML only supports AML Dataset (TabularDataset) input types in the Spark scenario right now. You can make a conversion like this:

df = spark.sql("SELECT * FROM default.nyc_taxi_train")
datastore = Datastore.get_default(ws)
dataset = TabularDatasetFactory.register_spark_dataframe(df, datastore, name = experiment_name + "-dataset")
automl_config = AutoMLConfig(spark_context = sc,....)

Also curious to learn more about your use case and how you intend to use AutoML in Synapse. Please let me know if you would be interested to connect on that topic.

Thanks, Nellie (from the Azure Synapse Team)

Josef
  • 2,869
  • 2
  • 22
  • 23
NGson
  • 11
  • 1