0

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.

Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.

Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.

Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.

It is the safest option, with time and complexity disadvantages.

Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.

https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb

Actually if the developers of the project think like that, I could give it a chance with whole data.

What do you think, I would love to hear about your experiences on FeatureTools.

1 Answers1

0

This question is a bit tricky to answer. There is a Featuretools DFSTransformer that can be used in pipelines. Unfortunately I don't think this can be used with GridSearchCV, because a Featuretools EntitySet does not currently implement any of the required methods to split the EntitySet automatically during the CV process. Here is the current DFSTransformer for reference: https://github.com/alteryx/featuretools-sklearn-transformer

Given that, I think either your Idea 1 or Idea 2 is the way to go here. I definitely would not suggest that you "ignore the leakage possibility", but depending on your data and the defined relationships, there may be some cases where you can create the feature matrix using all the data before you split into test/train sets. In other cases, you may need to manually do the splits and make sure you are not leaking information through the aggregations. For example, if you are trying to predict if an individual transaction is fraudulent for a customer, you would not want to include transactions that happen after the current row in any customer aggregations as that could be leaking future information into the current row. Using a cutoff time can help with situations like that, but leakage can happen if that is not done properly.

It is a little tough to provide specific guidance without knowing the details of your problem, but I believe you are on the right track with your thought process.

Nate Parsons
  • 376
  • 1
  • 2