If I understand your question correctly, what you are trying to do is aggregate player stats from the player_stats
table to the team_stats
table. One way to do this would be to create a relationship between those two tables and use various aggregation primitives.
That relationship can be defined by the combination of (team_id, game_id)
in the two tables. Unfortunately, Featuretools does not support defining relationships using multiple columns in one table, so you probably need to create a new column that can be used to define the relationship.
Here is one way you could do this:
import pandas as pd
import featuretools as ft
# Define team stats
team_stats = pd.DataFrame({
"id": [0, 1, 2, 3],
"team_id": [100, 200, 100, 200],
"game_id": [0, 0, 1, 1],
})
# Define player stats
player_stats = pd.DataFrame({
"id": [0, 1, 2, 3, 4, 5, 6, 7],
"team_id": [100, 100, 200, 200, 100, 100, 200, 200],
"game_id": [0, 0, 0, 0, 1, 1, 1, 1],
"player_id": [0, 1, 2, 3, 0, 1, 2, 3],
"goals": [1, 2, 0, 1, 2, 0, 1, 2],
"minutes_played": [30, 45, 10, 50, 10, 40, 25, 45],
})
# Create new index columns for use in relationship
team_stats["created_idx"] = team_stats["team_id"].astype("string") + "-" + team_stats["game_id"].astype("string")
player_stats["created_idx"] = player_stats["team_id"].astype("string") + "-" + player_stats["game_id"].astype("string")
# Drop these columns from the players table since we no longer need them in this example.
# This prevents Featuretools from generating features from these numeric columns
player_stats = player_stats.drop(columns=["team_id", "game_id"])
# Create the EntitySet and add dataframes
es = ft.EntitySet()
es.add_dataframe(dataframe=team_stats, dataframe_name="teams", index="created_idx")
es.add_dataframe(dataframe=player_stats, dataframe_name="players", index="id")
# Add the relationship
es.add_relationship("teams", "created_idx", "players", "created_idx")
# Run DFS using the `Sum` aggregation primitive
# Use ignore columns to prevent generation of features from player_id
fm, feautres = ft.dfs(entityset=es,
target_dataframe_name="teams",
agg_primitives=["sum"],
ignore_columns={"players": ["player_id"]})
This will generate a dataframe like this:
id team_id game_id SUM(players.goals) SUM(players.minutes_played) SUM(players.player_id)
created_idx
100-0 0 100 0 3.0 75.0 1.0
200-0 1 200 0 1.0 60.0 5.0
100-1 2 100 1 2.0 50.0 1.0
200-1 3 200 1 3.0 70.0 5.0