0

I have downloaded a trained model from Azure Machine Learning. It was trained with Automated ML, using the Time Series forecasting preset.

When I want to run predictions, I get this message:

NumericalizeTransformer: Column AircraftModel contains categories not present at fit: {('42',)}. These categories will be set to NA prior to encoding.
  .format(col, new_cats))
Column Operator contains categories not present at fit: {('US Airlines',)}. These categories will be set to NA prior to encoding.
  .format(col, new_cats))

My code for running forecast is this:

def load_model():
    global model
    model_path = 'model.pkl'
    model = joblib.load(model_path)

def run_forecast(data):
    try:
        y_query = data.pop('y_query').values
        #y_query.fill(np.nan)
        result = model.forecast(data, y_query)
    except Exception as e:
        result = str(e)
        return json.dumps({"error": result})

    forecast_as_list = result[0].tolist()

    return forecast_as_list

input_sample = pd.DataFrame(data=[{'AircraftId': 'ATR-0001', 'FromDate': '2016-09-01T00:00:00.000Z', 'AircraftModel': '42', 'Operator': 'US Airlines', 'Country': 'Denmark', 'MonthOfYear': 9, 'y_query': 1.0}])

load_model()

forecast = run_forecast(input)

I get a result returned, however it is quite bad and I suspect the omitted feature columns is the culprit.

Should I manually do some pre-processing before running inference on the model?

MartinHN
  • 19,542
  • 19
  • 89
  • 131

1 Answers1

1

It looks like the data that you're trying to score has categorical levels not seen during training (in the Aircraft Model and Operator columns). Can you please check your training data and see if the missing levels ('42' and 'US Airlines') are present there?

If not, Automated ML is unlikely to produce a good score if it hasn't seen those categories at training time.

  • Yeah, but given this is AutomatedML I would assume data is split into training, test and validation sets in a balanced way to avoid such a thing? – MartinHN Feb 13 '20 at 10:31