I am trying to build a simple demand forecasting model using Azure AutoML in Synapse Notebook using Spark and SQL Context.
After aggregating the item quantity with respect to date and item id, this is what my data looks like this in the event_file_processed.parquet
file:
The date range is from 2020-08-13 to 2021-02-08.
I am following this documentation by MS: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast
Here's how I have divided my train_data
and test_data
parquet files:
%%sql
CREATE OR REPLACE TEMPORARY VIEW train_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date <= '2020-12-20'
ORDER BY
the_date ASC`
%%sql
CREATE OR REPLACE TEMPORARY VIEW test_data
AS SELECT
*
FROM
event_file_processed
WHERE
the_date > '2020-12-20'
ORDER BY
the_date ASC`
%%pyspark
train_data = spark.sql("SELECT * FROM train_data")
train_data.write.parquet("train_data.parquet")
test_data = spark.sql("SELECT * FROM test_data")
test_data.write.parquet("test_data.parquet")`
Below are my AutoML settings and run submission:
from azureml.automl.core.forecasting_parameters import ForecastingParameters
forecasting_parameters = ForecastingParameters(time_column_name='the_date',
forecast_horizon=44,
time_series_id_column_names=["items_id"],
freq='W',
target_lags='auto',
target_aggregation_function = 'sum',
target_rolling_window_size = 3,
short_series_handling_configuration = 'auto'
)
train_data = spark.read.parquet("train_data.parquet")
train_data.createOrReplaceTempView("train_data")
label = "total_item_qty"
from azureml.core.workspace import Workspace
from azureml.core.experiment import Experiment
from azureml.train.automl import AutoMLConfig
import logging
automl_config = AutoMLConfig(task='forecasting',
primary_metric='normalized_root_mean_squared_error',
experiment_timeout_minutes=15,
enable_early_stopping=True,
training_data=train_data,
label_column_name=label,
n_cross_validations=3,
enable_ensembling=False,
verbosity=logging.INFO,
forecasting_parameters = forecasting_parameters)
from azureml.core import Workspace, Datastore
# Enter your workspace subscription, resource group, name, and region.
subscription_id = "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" #you should be owner or contributor
resource_group = "XXXXXXXXXXX" #you should be owner or contributor
workspace_name = "XXXXXXXXXXX" #your workspace name
ws = Workspace(workspace_name = workspace_name,
subscription_id = subscription_id,
resource_group = resource_group)
experiment = Experiment(ws, "AML-demand-forecasting-synapse")
local_run = experiment.submit(automl_config, show_output=True)
best_run, fitted_model = local_run.get_output()
I am badly stuck in the error below:
Error:
DataException: DataException:
Message: An invalid value for argument [y] was provided.
InnerException: InvalidValueException: InvalidValueException:
Message: Assertion Failed. Argument y is null. Target: y. Reference Code: b7440909-05a8-4220-b927-9fcb43fbf939
InnerException: None
ErrorResponse
I have checked that there are no null or rogue values in total_item_qty
, the type in the schema for the 3 variables is also correct.
If you can please give some suggestions, I'll be obliged.
Thanks, Shantanu Jain