2

I am trying to use the featuretools library to make new features on a simple dataset, however, whenever I try to use a bigger max_depth, nothing happens... Here is my code so far:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', max_depth=3,
                                      trans_primitives=['add_numeric', 'multiply_numeric'])

When I look at the features created, I get the basic things f1*f2 and f1+f2, but I would like more complex engineered features like f2*(f1+f2) or f1+(f2+f1). I thought increasing max_depth would do this but apparently not.
How could I do this, if at all?

MartinM
  • 209
  • 2
  • 10

2 Answers2

3

I have managed to answer my own question, so I'll post it here.
You can create deeper features by running "Deep Feature Synthesis" on already generated features. Here is an example:

# imports
import featuretools as ft

# creating the EntitySet
es = ft.EntitySet()
es.entity_from_dataframe(entity_id='data', dataframe=data, make_index=True, index='index')

# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data',
                                      trans_primitives=['add_numeric','multiply_numeric'])

# creating an EntitySet from the new features
deep_es = ft.EntitySet()
deep_es.entity_from_dataframe(entity_id='data', index='index', dataframe=feature_matrix)

# Run deep feature synthesis with transformation primitives
deep_feature_matrix, deep_feature_defs=ft.dfs(entityset=deep_es, target_entity='data',
                                              trans_primitives=['add_numeric','multiply_numeric'])

Now, looking at the columns of deep_feature_matrix here is what we see (assuming a dataset with 2 features):
"f1", "f2", "f1+f2", "f1*f2", "f1+f1*f2", "f1+f1+f2", "f1*f2+f1+f2", "f1*f2+f2", "f1+f2+f2", "f1*f1*f2", "f1*f1+f2", "f1*f2*f1+f2", "f1*f2*f2", "f1+f2*f2"

I have also made a function that automatically does this (includes a full docstring):

def auto_feature_engineering(X, y, selection_percent=0.1, selection_strategy="best", num_depth_steps=2, transformatives=['divide_numeric', 'multiply_numeric']):
    """
    Automatically perform deep feature engineering and 
    feature selection.

    Parameters
    ----------
    X : pd.DataFrame
        Data to perform automatic feature engineering on.
    y : pd.DataFrame
        Target variable to find correlations of all
        features at each depth step to perform feature
        selection, y is not needed if selection_percent=1.
    selection_percent : float, optional
        Defines what percent of all the new features to
        keep for the next depth step.
    selection_strategy : {'best', 'random'}, optional
        Strategy used for feature selection, if 'best', 
        it will select the best features for the next depth
        step, if 'random', it will select features at random.
    num_depth_steps : integer, optional
        The number of depth steps. Every depth step, the model
        generates brand new features from the features made in 
        the last step, then selects a percent of these new
        features.
    transformatives : list, optional
        List of all possible transformations of the data to use
        when feature engineering, you can find the full list
        of possible transformations as well as what each one
        does using the following code: 
        `ft.primitives.list_primitives()[ft.primitives.list_primitives()["type"]=="transform"]`
        make sure to `import featuretools as ft`.

    Returns
    -------
    pd.DataFrame
        a dataframe of the brand new features.
    """
    from sklearn.feature_selection import mutual_info_classif
    selected_feature_df = X.copy()
    for i in range(num_depth_steps):
        
        # Perform feature engineering
        es = ft.EntitySet()
        es.entity_from_dataframe(entity_id='data', dataframe=selected_feature_df, 
                                 make_index=True, index='index')
        feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity='data', trans_primitives=transformatives)
        
        # Remove features that are the same
        feature_corrs = feature_matrix.corr()[list(feature_matrix.keys())[0]]
        
        existing_corrs = []
        good_keys = []
        for key in feature_corrs.to_dict().keys():
            if feature_corrs[key] not in existing_corrs:
                existing_corrs.append(feature_corrs[key])
                good_keys.append(key)
        feature_matrix = feature_matrix[good_keys]
        
        # Remove illegal features
        legal_features = list(feature_matrix.columns)
        for feature in list(feature_matrix.columns):
            raw_feature_list = []
            for j in range(len(feature.split(" "))):
                if j%2==0:
                    raw_feature_list.append(feature.split(" ")[j])
            if len(raw_feature_list) > i+2: # num_depth_steps = 1, means max_num_raw_features_in_feature = 2
                legal_features.remove(feature)
        feature_matrix = feature_matrix[legal_features]
        
        # Perform feature selection
        if int(selection_percent)!=1:
            if selection_strategy=="best":
                corrs = mutual_info_classif(feature_matrix.reset_index(drop=True), y)
                corrs = pd.Series(corrs, name="")
                selected_corrs = corrs[corrs>=corrs.quantile(1-selection_percent)]
                selected_feature_df = feature_matrix.iloc[:, list(selected_corrs.keys())].reset_index(drop=True)
            elif selection_strategy=="random":
                selected_feature_df = feature_matrix.sample(frac=(selection_percent), axis=1).reset_index(drop=True)
            else:
                raise Exception("selection_strategy can be either 'best' or 'random', got '"+str(selection_strategy)+"'.")
        else:
            selected_feature_df = feature_matrix.reset_index(drop=True)
        if num_depth_steps!=1:
            rename_dict = {}
            for col in list(selected_feature_df.columns):
                rename_dict[col] = "("+col+")"
            selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    if num_depth_steps!=1:
        rename_dict = {}
        for feature_name in list(selected_feature_df.columns):
            rename_dict[feature_name] = feature_name[int(num_depth_steps-1):-int(num_depth_steps-1)]
        selected_feature_df = selected_feature_df.rename(columns=rename_dict)
    return selected_feature_df

Here is an example of using it:

# Imports
>>> import seaborn as sns
>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.preprocessing import OrdinalEncoder

# Load the penguins dataset
>>> penguins = sns.load_dataset("penguins")
>>> penguins.head()

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g     sex
0  Adelie  Torgersen            39.1           18.7              181.0       3750.0    Male
1  Adelie  Torgersen            39.5           17.4              186.0       3800.0  Female
2  Adelie  Torgersen            40.3           18.0              195.0       3250.0  Female 
3  Adelie  Torgersen             NaN            NaN                NaN          NaN     NaN
4  Adelie  Torgersen            36.7           19.3              193.0       3450.0  Female

# Fill in NaN values of features using the distribution of the feature
>>> for feature in ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g", "sex"]:
...     s = penguins[feature].value_counts(normalize=True)
...     dist = penguins[feature].value_counts(normalize=True).values
...     missing = penguins[feature].isnull()
...     penguins.loc[missing, feature] = np.random.choice(s.index, size=len(penguins[missing]),p=s.values)

# Make X and y
>>> X = penguins[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
>>> y = penguins[["sex"]]

# Encode "sex" so that "Male" is 1 and "Female" is 0
>>> ord_enc = OrdinalEncoder()
>>> y = pd.DataFrame(ord_enc.fit_transform(y).astype(np.int8), columns=["sex"])

# Generate new dataset with more features
>>> penguins_with_more_features = auto_feature_engineering(X, y, selection_percent=1.)

# Correlations of the raw features
>>> find_correlations(X, y)
body_mass_g          0.422959
bill_depth_mm        0.353526
bill_length_mm       0.342109
flipper_length_mm    0.246944
Name: sex, dtype: float64

# Top 10% correlations of new features
>>> summarize_corr_series(find_top_percent(find_correlations(penguins_with_more_features, y), 0.1))
(flipper_length_mm / bill_depth_mm) / (body_mass_g):       0.7241123396175027
(bill_depth_mm * body_mass_g) / (flipper_length_mm):       0.7237223914820166
(bill_depth_mm * body_mass_g) * (bill_depth_mm):           0.7222108721971968
(bill_depth_mm * body_mass_g):                             0.7202272416625914
(bill_depth_mm * body_mass_g) * (flipper_length_mm):       0.6425813490692588
(bill_depth_mm * bill_length_mm) * (body_mass_g):          0.6398235593646668
(bill_depth_mm * flipper_length_mm) * (flipper_length_mm): 0.6360645935216128
(bill_depth_mm * flipper_length_mm):                       0.6083364815975281
(bill_depth_mm * body_mass_g) * (body_mass_g):             0.5888925994060027

In this example, we would like to predict the gender of penguins given their attributes body_mass_g, bill_depth_mm, bill_length_mm and flipper_length_mm.

You might notice these other mysterious functions I used in the example, namely find_correlations, summarize_corr_series and find_top_percent. These are other convenient functions I made to help summarize the results from auto_feature_engineering. Here is the code to them (note they haven't been documented):

def summarize_corr_series(feature_corr_series):
    max_feature_name_size = 0
    for key in feature_corr_series.to_dict().keys():
        if len(key) > max_feature_name_size:
            max_feature_name_size = len(key)

    max_new_feature_corr = feature_corr_series.max()

    for key in feature_corr_series.to_dict().keys():
        whitespace = []
        for i in range(max_feature_name_size-len(key)):
            whitespace.append(" ")
        whitespace = "".join(whitespace)
        print(key+": "+whitespace+str(abs(feature_corr_series[key])))

def find_top_percent(series, percent):
    return series[series>series.quantile(1-percent)]

def find_correlations(X, y):
    return abs(pd.concat([X.reset_index(drop=True), y.reset_index(drop=True)], axis=1).corr())[y.columns[0]].drop(y.columns[0]).sort_values(ascending=False)
MartinM
  • 209
  • 2
  • 10
  • The problem with this method is that you end up with stacked features comprising up to 4 base features rather than 3 e.g. `f1*f2*f1+f2` – amin_nejad Feb 15 '21 at 16:07
  • @amain_nejad That is true, in the function I made it removes them automatically, but if you're not using it, you can use some simple Python to remove features comprised of more base features than you wish (by looking at the feature name). I will admit it's inefficient to create features only to remove them, but I'm not sure how to tell the `ft.dfs()` function to not combine certain features together. – MartinM Feb 15 '21 at 17:02
2

It is really unfortunate that featuretools does not easily support this use case since it appears to be quite common. The best way I've found to do this is to create the first order features you want using the dfs function and then add the second order features you want manually.

For instance the MWE below (using the iris dataset) performs the AddNumeric primitive using dfs and then applies the DivideNumeric primitive to the newly created features using only the original features (and avoids the same base feature appearing multiple times in a transformed feature).

import numpy as np
import pandas as pd
import sklearn
import featuretools as ft

iris = sklearn.datasets.load_iris()

data = pd.DataFrame(
    data= np.c_[iris['data'], 
    iris['target']],
    columns= iris['feature_names'] + ['target']
)

ignore_cols = ['target']

entity_set = ft.EntitySet(id="iris")
entity_set.entity_from_dataframe(
    entity_id="iris_main",
    dataframe=data,
    index="index",
)

new_features = ft.dfs(
    entityset=entity_set,
    target_entity="iris_main",
    trans_primitives=["add_numeric"],
    features_only=True,
    primitive_options={
        "add_numeric": {
            "ignore_variables": {"iris_main": ignore_cols},
        },
    },
)

transformed_features = [i for i in new_features if isinstance(i, ft.feature_base.feature_base.TransformFeature)]
original_features = [i for i in new_features if isinstance(i, ft.feature_base.feature_base.IdentityFeature) and i.get_name() not in ignore_cols]

depth_two_features = []
for trans_feat in transformed_features:
    for orig_feat in original_features:
        if orig_feat.get_name() not in [i.get_name() for i in trans_feat.base_features]:
            feat = ft.Feature([trans_feat, orig_feat], primitive=ft.primitives.DivideNumeric)
            depth_two_features.append(feat)
            
data = ft.calculate_feature_matrix(
    features= original_features + transformed_features + depth_two_features,
    entityset=entity_set,
    verbose=True,
)

The benefit of this approach is that it gives you more fine grained control to customise this how you want and avoids the computational cost of creating unnecessary features you don't want.

amin_nejad
  • 989
  • 10
  • 22
  • 1
    Thank you for the answer! This is definitely the more efficient approach (and a more direct answer to the question I asked in the first place). +1 from me. It is a shame that there's no easier way to do it though... – MartinM Mar 11 '21 at 08:41
  • 1
    No worries! Yes definitely a shame and it could be documented better too – amin_nejad Mar 12 '21 at 13:44