How to write seed_features that work with cutoff_time

Question

I am trying to count the number of first place finishes each runner has had in the past as they compete in their current race, however, a ValueError occurs when I run the code.

I'm loading a series of races and runners into Featuretools.

Entity: races
  Variables:
    id (dtype: index)
    race_date (dtype: datetime)
    racecard_date (dtype: datetime)
    runner_id (dtype: id)
    lane_number (dtype: numeric)
    place (dtype: numeric)

Entity: runners
  Variables:
    id (dtype: index)
    name (dtype: text)

For the races, some fields are known before the race starts (e.g. lane_number) and some fields are only known after the race finishes (e.g. place). The parameter cutoff_times is specified using the id and race_date columns to separate these two sets of fields.

cutoff_times = (df_races
                .reset_index()
                .filter(['id', 'race_date']))

Setting a seed feature that defines a first place finish:

first_place = (ft.Feature(es['races']['place']) == 1)

Running:

df_features, feature_names = ft.dfs(
    entityset=es,
    target_entity='races',
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,
    agg_primitives=['num_true'],
    seed_features=[first_place],
    max_depth=3,
)

Results in:

ValueError: Cannot convert non-finite values (NA or inf) to integer

I think this is because, for each computed race, ft.Feature(es['races']['place']) equals NaN for its most recent value, since place is only accessible after the race. Since we are comparing NaN == 1, the code fails.

Is there any way to make this work, short of manually appending a helper column called first_place to df_races before loading the Pandas DataFrame into Featuretools? I would prefer to have everything done in Featuretools.

this should be something that FT can handle. just to confirm, is the issue simply that FT errors when it does the `NaN == 1` calculation? if so, that is a bug and we (the core developers) will fix it. — Max Kanter, Aug 22 '19 at 19:54
@MaxKanter Yes, exactly! Since `cutoff_times` effectively hides the current value as `NaN` as each row is processed, `ft.Feature(es['races']['place'])` will always contain a `NaN` and will fail when compared to `1`. Maybe a `dropna` or a `fillna` can be applied before the comparison. — Timothy, Aug 26 '19 at 02:37
got it. we have a PR up on GitHub that should fix. will try to get it in the next release. If you'd like, you can install that branch directly and test. If it works, it'd be helpful for you to let us know on GitHub in the comments. Here's the PR: https://github.com/Featuretools/featuretools/pull/504 — Max Kanter, Aug 26 '19 at 12:59
@MaxKanter Thanks for fixing the problem and the amazingly fast turnaround. I've commented on the PR that the issue is now resolved. Keep up the good work on Featuretools! — Timothy, Aug 26 '19 at 14:18

How to write seed_features that work with cutoff_time

0 Answers0