Restricting feature generation to a particular entity in FeatureTools

Question

I'm trying to understand how to specify primitive_options in FeatureTools (version 0.16) to include only a certain entity. Based on the docs I should be using include_entities:

List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).

Simple case

Here's some example code:

import pprint
from featuretools.primitives import GreaterThanScalar

esd1 = ft.demo.load_mock_customer(return_entityset=True)

def run_dfs(esd, primitive_options={}):
    feature_defs = ft.dfs(
        entityset=esd,
        target_entity="customers",
        agg_primitives=["count"],
        where_primitives=["count",GreaterThanScalar(value=0)],
        trans_primitives=[GreaterThanScalar(value=0)],
        primitive_options=primitive_options,
        max_depth=4,
        features_only=True
    )
    pprint.pprint(feature_defs)

run_dfs(esd1)

This produces:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions) > 0>]

Suppose I'm interested in the sessions and transactions counts and whether sessions where larger than 0. Based on the docs I'd go for include_entities here:

run_dfs(esd1, primitive_options={
          "greater_than_scalar":{
              "include_entities":['sessions']}
        })

The output from this, however, is:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>]

Both GreaterThanScalar features are gone now. If I use ignore_entities instead I get:

run_dfs(esd1, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
            }
        })

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions) > 0>]

So it works, but I'm not sure why ignore_entities gives the result I need and include_entities does not. Am I missing something?

More complex case

Although I sort of got the simple case to work, what I really want is something a bit more complicated. I'd like to to get a boolean feature that tells me whether there were more than zero sessions on a particular device.

Do do this:

esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)

yielding:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions) > 0>,
 <Feature: COUNT(sessions) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(sessions WHERE device = desktop) > 0>,
 <Feature: COUNT(sessions WHERE device = tablet) > 0>,
 <Feature: COUNT(sessions WHERE device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]

The features I need are 4 to 6 counting from the bottom. If I try to restrict dfs to limit itself to sessions entity and device variables:

run_dfs(esd2, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions"],
                "include_variables":{"sessions":["device"]}
            }
        })

the result is:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>]

No GreaterThanScalar features.

Is there a way to make dfs to give me just the three GreaterThanScalar features I want here?

Update: Third case

Is there a way to limit what gets counted under where? For example:

esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()

run_dfs(esd3, primitive_options={
            "greater_than_scalar":{
                "ignore_entities":["transactions","sessions"],
            },
            "count":{
                "ignore_variables":{"transactions":['session_id']}
            }
        })

gives:

[<Feature: zip_code>,
 <Feature: COUNT(sessions)>,
 <Feature: COUNT(transactions)>,
 <Feature: COUNT(sessions WHERE device = desktop)>,
 <Feature: COUNT(sessions WHERE device = tablet)>,
 <Feature: COUNT(sessions WHERE device = mobile)>,
 <Feature: COUNT(transactions WHERE sessions.device = mobile)>,
 <Feature: COUNT(transactions WHERE products.brand = B)>,
 <Feature: COUNT(transactions WHERE sessions.device = tablet)>,
 <Feature: COUNT(transactions WHERE products.brand = A)>,
 <Feature: COUNT(transactions WHERE sessions.device = desktop)>]

Is it possible to limit the COUNT(transactions WHERE ...) features to only products. I'd still want to keep the COUNT sessions ... features.

score 3 · Accepted Answer · answered Jun 24 '20 at 15:13

3

Adding 'session_id' from the 'sessions' entity to the include_variables option will generate the features you're looking for:

primitive_options={
    "greater_than_scalar":{
         "ignore_entities":["transactions"],
         "include_variables":{"sessions":["session_id", "device"]}}}

The Count primitive uses the entity index as its base, as well as any where columns. If you only include the where column for the GreaterThanScalar primitive options, dfs ends up ignoring all the Count features for GreaterThanScalar because they all use an implicitly ignored column (the entity index). In this case, the desired Count variables use the 'sessions' entity, so adding the 'sessions' entity index ('session_id') to the included_variables option allows for the desired features to be generated.

Also, in the first example using include_entities, the GreaterThanScalar features are lost because the 'customers' entity (the target entity) isn't included. The Count features are all aggregation features in the 'customers' entity; they represent the count of something per each customer. In order to use the Count features, the GreaterThanScalar primitive needs to be allowed to use both the 'customers' entity where the Count features are located as well as the entity that the desired Count feature is based on ('sessions' in this case).

answered Jun 24 '20 at 15:13

Frances Hartwell

191
1

Thank you for the detailed explanation, Frances! This is a bit more subtle than the documentation let's on. If I add `session_id` to `include_variables` I still get ` 0>` as a feature. Any way to exclude that? – numentar Jun 24 '20 at 18:00
I added a third case to the original question about limiting what gets included in `where` features. – numentar Jun 24 '20 at 19:17
1

Currently, there isn't a way to require specific variables be present so I don't believe there's an easy way to prune to `Feature: COUNT(sessions) > 0>` using `primitive_options`. However, `dfs` has the optional parameters `drop_contains` and `drop_exact` that might suit your needs. – Frances Hartwell Jun 25 '20 at 11:22
1

Similarly for your third case, the logic is a bit too complicated. One thing you could try, however, is to use two separate instantiated `count` primitives each with a different set of options to limit what variables/entities they can/can't act on. For example, one could ignore the 'transactions' entity entirely while the other includes the 'transactions' entity along with whatever additional restrictions you want to place on the primitive. – Frances Hartwell Jun 25 '20 at 11:39
Separate `count` primitives is a very interesting approach. I'll look into that. And thanks again for your help Frances. I realize that not all cases can or need to be handled via `primitive_options`. Filtering by `drop_...` or through postprocessing is also an option. Just wanted to make sure that I'm not missing anything. – numentar Jun 25 '20 at 19:59

Restricting feature generation to a particular entity in FeatureTools

Simple case

More complex case

Update: Third case

1 Answers1