I'm trying to understand how to specify primitive_options
in FeatureTools (version 0.16) to include only a certain entity. Based on the docs I should be using include_entities
:
List of entities to be included when creating features for the primitive(s). All other entities will be ignored (list[str]).
Simple case
Here's some example code:
import pprint
from featuretools.primitives import GreaterThanScalar
esd1 = ft.demo.load_mock_customer(return_entityset=True)
def run_dfs(esd, primitive_options={}):
feature_defs = ft.dfs(
entityset=esd,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count",GreaterThanScalar(value=0)],
trans_primitives=[GreaterThanScalar(value=0)],
primitive_options=primitive_options,
max_depth=4,
features_only=True
)
pprint.pprint(feature_defs)
run_dfs(esd1)
This produces:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions) > 0>]
Suppose I'm interested in the sessions and transactions counts and whether sessions where larger than 0. Based on the docs I'd go for include_entities
here:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"include_entities":['sessions']}
})
The output from this, however, is:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>]
Both GreaterThanScalar features are gone now. If I use ignore_entities
instead I get:
run_dfs(esd1, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
}
})
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions) > 0>]
So it works, but I'm not sure why ignore_entities
gives the result I need and include_entities
does not. Am I missing something?
More complex case
Although I sort of got the simple case to work, what I really want is something a bit more complicated. I'd like to to get a boolean feature that tells me whether there were more than zero sessions on a particular device.
Do do this:
esd2 = ft.demo.load_mock_customer(return_entityset=True)
esd2['sessions'].add_interesting_values()
run_dfs(esd2)
yielding:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions) > 0>,
<Feature: COUNT(sessions) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(sessions WHERE device = desktop) > 0>,
<Feature: COUNT(sessions WHERE device = tablet) > 0>,
<Feature: COUNT(sessions WHERE device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = tablet) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = mobile) > 0>,
<Feature: COUNT(transactions WHERE sessions.device = desktop) > 0>]
The features I need are 4 to 6 counting from the bottom. If I try to restrict dfs
to limit itself to sessions entity and device variables:
run_dfs(esd2, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions"],
"include_variables":{"sessions":["device"]}
}
})
the result is:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>]
No GreaterThanScalar features.
Is there a way to make dfs
to give me just the three GreaterThanScalar features I want here?
Update: Third case
Is there a way to limit what gets counted under where
? For example:
esd3 = ft.demo.load_mock_customer(return_entityset=True)
esd3['sessions'].add_interesting_values()
esd3['products'].add_interesting_values()
run_dfs(esd3, primitive_options={
"greater_than_scalar":{
"ignore_entities":["transactions","sessions"],
},
"count":{
"ignore_variables":{"transactions":['session_id']}
}
})
gives:
[<Feature: zip_code>,
<Feature: COUNT(sessions)>,
<Feature: COUNT(transactions)>,
<Feature: COUNT(sessions WHERE device = desktop)>,
<Feature: COUNT(sessions WHERE device = tablet)>,
<Feature: COUNT(sessions WHERE device = mobile)>,
<Feature: COUNT(transactions WHERE sessions.device = mobile)>,
<Feature: COUNT(transactions WHERE products.brand = B)>,
<Feature: COUNT(transactions WHERE sessions.device = tablet)>,
<Feature: COUNT(transactions WHERE products.brand = A)>,
<Feature: COUNT(transactions WHERE sessions.device = desktop)>]
Is it possible to limit the COUNT(transactions WHERE ...)
features to only products
. I'd still want to keep the COUNT sessions ...
features.