0

I am facing this error as I am trying to transform data with my scikit learn model.

The model is built as follows:

feature_columns_names = [
   'transaction_id', 'created_at', 'amount', 'device_model','device_mode',
   'transaction_sum', 'daily_amt_ratio', 'monthly_amt_ratio'
]  
label_column = "is_fraud"
non_scaled_cols = ['created_at','device_model','device_mode','transaction_id','is_fraud']
numeric_features = [col for col in list(feature_columns_names) if col not in non_scaled_cols]
categorical_features = ['device_model','device_mode']

numeric_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value=0), 
    StandardScaler())

categorical_transformer = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="unknown"),
    OneHotEncoder(handle_unknown="ignore"),
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),            
        ('cat', categorical_transformer, categorical_features)],
        remainder="drop")

preprocessor.fit(data)

joblib.dump(preprocessor, os.path.join(args.model_dir, "model.joblib"))

Here is my code for loading and using the model to transform my data:

feature_columns_dtype = {
'transaction_id'                     :'object',
'created_at'                         :'object' ,
'amount'                             :'float64',
'device_model'                       :'object' ,
'device_mode'             :           'object' ,
'transaction_sum'          :          'float64',
'daily_amt_ratio'              :      'float64',
'monthly_amt_ratio'              :    'float64',
}

label_column_dtype = {"is_fraud": "int64"}  


def merge_two_dicts(x, y):
    z = x.copy()  # start with x's keys and values
    z.update(y)  # modifies z with y's keys and values & returns None
    return z

df = pd.read_csv('s3://data/dataset_sample.csv',
                 header=None, 
                 names=feature_columns_names + [label_column],
                 dtype=merge_two_dicts(feature_columns_dtype, label_column_dtype))

if len(df.columns) == len(feature_columns_names) + 1:
    # This is a labelled example, includes the ring label
    df.columns = feature_columns_names + [label_column]
elif len(df.columns) == len(feature_columns_names):
    # This is an unlabelled example.
    df.columns = feature_columns_names

model = joblib.load(os.path.join(model_dir, "model.joblib"))

model.transform(df)

The model loads correctly as well as the data but the last line (calling transform) on the data (df) produces the error:

AttributeError: 'ColumnTransformer' object has no attribute '_feature_names_in'

I have made sure the version of scikit learn am using is the same as the model's version, feature names are provided, and input data passed correctly, any clue what could be causing the error?

MSS
  • 35
  • 4
  • Can you share the scikit-learn version you're using?seems to be some mismatch in the sklearn/python version. – durga_sury Apr 03 '23 at 15:03
  • scikit-learn==1.0.2, I ensured there is no mismatch .. also it worked when I removed the Imputer from both pipelines and made a one step scaler/one-hot-encoder preprocesser, just can't get it to work with the a 2-step pipeline which is weird. – MSS Apr 04 '23 at 07:24
  • This looks to be a SKLearn issue and not a SageMaker one. @MSS are you able to get this to work outside of SageMaker locally? If so what version of SKLearn are you using? – Marc Karp Apr 04 '23 at 16:44

0 Answers0