0

I'm working with Azure ML for the first time, so please excuse any newbie mistakes!

My training pipeline takes a dataset generated by an ADF dataflow which uses the Pivot modifier to transform rows into columns (the source dataset is a list of projects and corresponding technologies).

e.g.

Project Technology
project1 tech1
project1 tech2
project2 tech1
project2 tech3
project3 tech4

The data is transformed by the ADF dataflow to:

Project tech1 tech2 tech3 tech4
project1 true true false false
project2 true false true false
project3 false false false true

Extra columns are added and then the transformed data is sinked to ADLSGen2 from where it's ingested into Azure ML. I've then created an Training pipeline in Azure ML which runs a linear regression model on the data, scoring my label column.

Training pipeline

From here I was able to create a realtime Inference pipeline with a web service input and output.

Inference pipeline

I was able to deploy the endpoint and test it using the test tool within the Endpoint detail page. My issue is when I remove features from the input json (e.g. only passing tech1, tech2 as boolean) I hit the error: Input Data Error. Input data are inconsistent with schema

This makes sense, since the inference pipeline obviously expects features that match the training data. Since the UI calling the ML endpoint won't necessary know all the available features (read technologies), I need to find a way to add any missing columns dynamically. The list of technologies is long so they can't be added manually. I think the solution is to join to my source dataset, adding any missing columns (features) to the web service payload.

Tried this but it failed to deploy the endpoint with an error that the adf-sink datasource is unsupported

How do I go about fixing this? Thank you!

UPDATE: 4/18/24

I've since found a better way of tackling this is to join the rows into a single space delimited column which I then process using the "Extract N-gram features from Text" component.

My input dataset generated from ADF now looks like:

Project Technology
project1 tech1 tech2
project2 tech1 tech3
project3 tech4

The next problem I hit was my inference pipeline is always returning an empty dataset but I have started a separate thread for that here.

Simon
  • 1
  • 2

0 Answers0