Featuretools works best with data that has been annotated directly with time information to handle cases like this. The, when calculating your features, you specify a "cutoff time" that you want to filter data out afterwards. If we restructure your data, and add in some time information, Featuretools can accomplish what you want.
First, let me create a DataFrame of people
import pandas as pd
people = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"sex": ['f', 'f', 'm', 'm', 'f', 'f', 'm', 'm']})
which looks like this
id sex
0 1 f
1 2 f
2 3 m
3 4 m
4 5 f
5 6 f
6 7 m
7 8 m
Then, let's create a separate DataFrame of scores where we annotate each score with the time it occurred. This can be either an datetime or an integer. For simplicity in this example, I'll use time 0
for training data and time 1
for the test data.
scores = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6, 7, 8],
"person_id": [1, 2, 3, 4, 5, 6, 7, 8],
"time": [0, 0, 0, 0, 1, 1, 1, 1],
"score": [100, 200, 10, 20, 30, 40, 50, 60]})
which looks like this
id person_id score time
0 1 1 100 0
1 2 2 200 0
2 3 3 10 0
3 4 4 20 0
4 5 5 30 1
5 6 6 40 1
6 7 7 50 1
7 8 8 60 1
Now, let's create an EntitySet in Featuretools specifying the "time index" in the scores entity
import featuretools as ft
es = ft.EntitySet('example')
es.entity_from_dataframe(dataframe=people,
entity_id='people',
index='id')
es.entity_from_dataframe(dataframe=scores,
entity_id='scores',
index='id',
time_index= "time")
# create a sexes entity
es.normalize_entity(base_entity_id="people", new_entity_id="sexes", index="sex")
# add relationship for scores to person
scores_relationship = ft.Relationship(es["people"]["id"],
es["scores"]["person_id"])
es = es.add_relationship(scores_relationship)
es
Here is our entity set
Entityset: example
Entities:
scores [Rows: 8, Columns: 4]
sexes [Rows: 2, Columns: 1]
people [Rows: 8, Columns: 2]
Relationships:
scores.person_id -> people.id
people.sex -> sexes.sex
Next, let's calculate the feature of interest. Notice when we use the cutoff_time
argument to specify the last time data is allowed to be used for the calculation. This ensures none of our testing data is made available during calculation.
from featuretools.primitives import Mean
mean_by_sex = ft.Feature(Mean(es["scores"]["score"], es["sexes"]), es["people"])
ft.calculate_feature_matrix(entityset=es, features=[mean_by_sex], cutoff_time=0)
The output is now
sexes.MEAN(scores.score)
id
1 150
2 150
3 15
4 15
5 150
6 150
7 15
8 15
This functionality is powerful because we can handle time in a more fine grained manner than a single train / test split.
For information on how time indexes work in Featuretools read the Handling Time page in the documentation.
EDIT
If you want to automatically define many features, you can use Deep Feature Synthesis by calling ft.dfs
feature_list = ft.dfs(target_entity="people",
entityset=es,
agg_primitives=["count", "std", "max"],
features_only=True)
feature_list
this returns feature definitions you can pass to ft.calculate_feature_matrix
[<Feature: sex>,
<Feature: MAX(scores.score)>,
<Feature: STD(scores.time)>,
<Feature: STD(scores.score)>,
<Feature: COUNT(scores)>,
<Feature: MAX(scores.time)>,
<Feature: sexes.STD(scores.score)>,
<Feature: sexes.COUNT(people)>,
<Feature: sexes.STD(scores.time)>,
<Feature: sexes.MAX(scores.score)>,
<Feature: sexes.MAX(scores.time)>,
<Feature: sexes.COUNT(scores)>]
Read more about DFS in this write-up