3

I know featuretools has ft.calculate_feature_matrix method, but it calculate data use test. I need when I get the feature use train data,and join to test data not use the same feature on test data. for example: train data:

id sex score
1 f 100
2 f 200
3 m 10
4 m 20

after dfs, I get:

id sex score sex.mean(score)
1 f 100 150
2 f 200 150
3 m 10 15
4 m 20 15

i want get like this on test set:

id sex score sex.mean(score)
5 f 30 150
6 f 40 150
7 m 50 15
8 m 60 15

not

id sex score sex.mean(score)
5 f 30 35
6 f 40 35
7 m 50 55
8 m 60 55

how can i realization it, thanks you。

Sasha Tsukanov
  • 1,025
  • 9
  • 20
Z.jiasen
  • 51
  • 4
  • 1
    @SashaTsukanov `Test set shouldn't operate on values from train` That is not correct and has been discussed several times. For example: https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i – GuSuku Aug 20 '19 at 04:44

1 Answers1

2

Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.

First, let me create a DataFrame of people

import pandas as pd

people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})

which looks like this

   id sex
0   1   f
1   2   f
2   3   m
3   4   m
4   5   f
5   6   f
6   7   m
7   8   m

Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0 for training data and time 1 for the test data.

scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "person_id": [1, 2, 3, 4, 5, 6, 7, 8],
                       "time": [0, 0, 0, 0, 1, 1, 1, 1],
                       "score": [100, 200, 10, 20, 30, 40, 50, 60]})

which looks like this

   id  person_id  score  time
0   1          1    100     0
1   2          2    200     0
2   3          3     10     0
3   4          4     20     0
4   5          5     30     1
5   6          6     40     1
6   7          7     50     1
7   8          8     60     1

Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity

import featuretools as ft

es = ft.EntitySet('example')

es.entity_from_dataframe(dataframe=people,
                         entity_id='people',
                         index='id')

es.entity_from_dataframe(dataframe=scores,
                         entity_id='scores',
                         index='id',
                         time_index= "time")

# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")

# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
                                      es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)


es

Here is our entity set

Entityset: example
  Entities:
    scores [Rows: 8, Columns: 4]
    sexes [Rows: 2, Columns: 1]
    people [Rows: 8, Columns: 2]
  Relationships:
    scores.person_id -> people.id
    people.sex -> sexes.sex

Next, let's calculate the feature of interest. Notice when we use the cutoff_time argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.

from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)

The output is now

    sexes.MEAN(scores.score)
id
1                        150
2                        150
3                         15
4                         15
5                        150
6                        150
7                         15
8                         15

This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.

For information on how time indexes work in Featuretools read the Handling Time page in the documentation.

EDIT

If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs

feature_list = ft.dfs(target_entity="people",
                      entityset=es,
                      agg_primitives=["count", "std", "max"],
                      features_only=True)
feature_list

this returns feature definitions you can pass to ft.calculate_feature_matrix

[<Feature: sex>,
 <Feature: MAX(scores.score)>,
 <Feature: STD(scores.time)>,
 <Feature: STD(scores.score)>,
 <Feature: COUNT(scores)>,
 <Feature: MAX(scores.time)>,
 <Feature: sexes.STD(scores.score)>,
 <Feature: sexes.COUNT(people)>,
 <Feature: sexes.STD(scores.time)>,
 <Feature: sexes.MAX(scores.score)>,
 <Feature: sexes.MAX(scores.time)>,
 <Feature: sexes.COUNT(scores)>]

Read more about DFS in this write-up

Max Kanter
  • 2,006
  • 6
  • 16
  • Thank you, according to your guidance has been achieved. – Z.jiasen Oct 09 '18 at 08:58
  • Also, if I need to achieve the MAX、STD... in this way, and not just 'score' one column, and more than a 'sex' category, does it mean that I must be created many entities, relationships, and ft.Feature(Mean(es["xxx"]["x"], es["yyy"]), es["zzz"]) – Z.jiasen Oct 09 '18 at 09:01
  • added to the end of my answer to show how to call Deep Feature Synthesis to automatically generate features for you. – Max Kanter Oct 10 '18 at 14:17