how to use ft.dfs result join to test set？

Question

I know featuretools has ft.calculate_feature_matrix method， but it calculate data use test. I need when I get the feature use train data，and join to test data not use the same feature on test data. for example: train data:

id sex score
1 f 100
2 f 200
3 m 10
4 m 20

after dfs, I get：

id sex score sex.mean(score)
1 f 100 150
2 f 200 150
3 m 10 15
4 m 20 15

i want get like this on test set：

id sex score sex.mean(score)
5 f 30 150
6 f 40 150
7 m 50 15
8 m 60 15

not

id sex score sex.mean(score)
5 f 30 35
6 f 40 35
7 m 50 55
8 m 60 55

how can i realization it, thanks you。

@SashaTsukanov `Test set shouldn't operate on values from train` That is not correct and has been discussed several times. For example: https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i — GuSuku, Aug 20 '19 at 04:44

Max Kanter · Answer 1 · 2018-10-10T14:16:14.543

Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.

First, let me create a DataFrame of people

import pandas as pd

people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})

which looks like this

Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0 for training data and time 1 for the test data.

scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "person_id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "time": [0, 0, 0, 0, 1, 1, 1, 1],
                       "score": [100, 200, 10, 20, 30, 40, 50, 60]})

which looks like this

   id  person_id  score  time
0   1          1    100     0
1   2          2    200     0
2   3          3     10     0
3   4          4     20     0
4   5          5     30     1
5   6          6     40     1
6   7          7     50     1
7   8          8     60     1

Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity

import featuretools as ft

es = ft.EntitySet('example')

es.entity_from_dataframe(dataframe=people,
                         entity_id='people',
                         index='id')

es.entity_from_dataframe(dataframe=scores,
                         entity_id='scores',
                         index='id',
                         time_index= "time")

# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")

# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
                                      es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)


es

Here is our entity set

Entityset: example
  Entities:
    scores [Rows: 8, Columns: 4]
    sexes [Rows: 2, Columns: 1]
    people [Rows: 8, Columns: 2]
  Relationships:
    scores.person_id -> people.id
    people.sex -> sexes.sex

Next, let's calculate the feature of interest. Notice when we use the cutoff_time argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.

from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)

The output is now

    sexes.MEAN(scores.score)
id
1                        150
2                        150
3                         15
4                         15
5                        150
6                        150
7                         15
8                         15

This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.

For information on how time indexes work in Featuretools read the Handling Time page in the documentation.

EDIT

If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs

feature_list = ft.dfs(target_entity="people",
                      entityset=es,
                      agg_primitives=["count", "std", "max"],
                      features_only=True)
feature_list

this returns feature definitions you can pass to ft.calculate_feature_matrix

[<Feature: sex>,
 <Feature: MAX(scores.score)>,
 <Feature: STD(scores.time)>,
 <Feature: STD(scores.score)>,
 <Feature: COUNT(scores)>,
 <Feature: MAX(scores.time)>,
 <Feature: sexes.STD(scores.score)>,
 <Feature: sexes.COUNT(people)>,
 <Feature: sexes.STD(scores.time)>,
 <Feature: sexes.MAX(scores.score)>,
 <Feature: sexes.MAX(scores.time)>,
 <Feature: sexes.COUNT(scores)>]

how to use ft.dfs result join to test set？

1 Answers1