I am trying to count the number of first place finishes each runner has had in the past as they compete in their current race, however, a ValueError
occurs when I run the code.
I'm loading a series of races and runners into Featuretools.
Entity: races
Variables:
id (dtype: index)
race_date (dtype: datetime)
racecard_date (dtype: datetime)
runner_id (dtype: id)
lane_number (dtype: numeric)
place (dtype: numeric)
Entity: runners
Variables:
id (dtype: index)
name (dtype: text)
For the races, some fields are known before the race starts (e.g. lane_number
) and some fields are only known after the race finishes (e.g. place
). The parameter cutoff_times
is specified using the id
and race_date
columns to separate these two sets of fields.
cutoff_times = (df_races
.reset_index()
.filter(['id', 'race_date']))
Setting a seed feature that defines a first place finish:
first_place = (ft.Feature(es['races']['place']) == 1)
Running:
df_features, feature_names = ft.dfs(
entityset=es,
target_entity='races',
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
agg_primitives=['num_true'],
seed_features=[first_place],
max_depth=3,
)
Results in:
ValueError: Cannot convert non-finite values (NA or inf) to integer
I think this is because, for each computed race, ft.Feature(es['races']['place'])
equals NaN
for its most recent value, since place
is only accessible after the race. Since we are comparing NaN == 1
, the code fails.
Is there any way to make this work, short of manually appending a helper column called first_place
to df_races
before loading the Pandas DataFrame into Featuretools? I would prefer to have everything done in Featuretools.