Does the "tpot" model object automatically apply any scaling or other transformations when .score or .predict is called on new out-of-sample data?

Question

Here is basic code for training a model in TPOT:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

In the end, it scores the data on the test set without explicitly doing the transformations that were done on the training set. A few questions here.

Does the "tpot" model object automatically apply any scaling or other transformations when .score or .predict is called on new out-of-sample data?
If not, what's the proper way of performing transformations on the test set before calling .score .predict on it.

Please educate if I'm completely misunderstanding this, please. Thank you.

In the AutoML you dont need to worry about preprocessing, it will be handled by AutoML. Just call the `fit()` method and then you can use AutoML object for computing predictions. I can also recommend you for checking MLJAR AutoML https://github.com/mljar/mljar-supervised - it has ML explanations and automatic documentation — pplonski, Apr 23 '21 at 11:39

score 1 · Accepted Answer · answered Apr 22 '21 at 16:12

Does the "tpot" model object automatically apply any scaling or other transformations when .score or .predict is called on new out-of-sample data?

That depends on the final pipeline that TPOT chose. However, if the final pipeline that TPOT chose has any sort of data scaling or transformation, then it correctly applies those scaling and transformation operations in the predict and score functions as well.

The reason for this is because, under the hood, TPOT is optimizing scikit-learn Pipeline objects.

That said, if there are specific transformations to your data that you want to guarantee happen with your data, then you have a couple options:

You can split your data into training and test, learn the transformation (e.g., StandardScaler) on the training set, then also apply it to your test set. You would do both of these operations before ever passing the data to TPOT.
You can make use of TPOT's template functionality, which allows you to specify constraints on what the analysis pipeline should look like.

So can I use tpot.predict() on brand new data which wasn't in train_test_split without having to transform it with all of the steps in the pipeline minus the estimator? Am I understanding this correctly: the predict method automatically performs these steps? Sorry if I'm asking the same thing a different way, just want to make sure I have this down. — aleinikov, Apr 22 '21 at 16:47
Yes that's correct. Any transformations that TPOT learned to do after you passed the data into TPOT, it will apply in the `predict` function as well. — Randy Olson, Apr 23 '21 at 15:16

Does the "tpot" model object automatically apply any scaling or other transformations when .score or .predict is called on new out-of-sample data?

1 Answers1