0

UPDATED QUESTION

I am trying to build a hybrid recommender system in python based on the lightFM library. The input data contains information on users, actions (things I am trying to recommend) and binary ratings. I also have user features that are updated with every entry i.e. trafic activity, age in days, and a few more in a typical cross sectional data layout. I also have item features that are pretty static that I can use. The two dummy dataframes that represent the problem are the following:

import pandas as pd

# create dummy dataset
df_user_inter = pd.DataFrame(
    {
        'deviceid': ['u1','u1','u2','u2', 'u3', 'u3', 'u3', 'u4', 'u4', 'u4', 'u4'], 
        'action_id': ['action1', 'action3', 'action1', 'action2', 'action3', 'action3', 'action1', 'action1', 'action2', 'action3', 'action2'], 
        'rating': [0,1, 1,0, 1,1,0, 0,0,1,0],
        'user_feature_1': [15,25, 28,7,21,18,5,8,2.5,12,0.5],
        'user_feature_2': [True, False, False, False, True, True, False, False, False, True, True],
        'user_feature_3' : [0,5,4,5,1,2,1,3,2,3,0]
    }
)
print(df_user_inter)
>>>
    deviceid    action_id   rating  user_feature_1  user_feature_2  user_feature_3
0     u1         action1      0         15.0            True               0
1     u1         action3      1         25.0            False              5
2     u2         action1      1         28.0            False              4
3     u2         action2      0          7.0            False              5
4     u3         action3      1         21.0             True              1
5     u3         action3      1         18.0             True              2
6     u3         action1      0          5.0            False              1
7     u4         action1      0          8.0            False              3
8     u4         action2      0          2.5            False              2
9     u4         action3      1         12.0             True              3
10    u4         action2      0          0.5             True              0

# create dummy item features
df_action = pd.DataFrame(
    {
        'action_id': ['action1', 'action2', 'action3'], 
        'action_feature_1': [0.15, 0, 0.25],
        'action_feature_2': ['A', 'B', 'C'],
        'action_feature_3': [True, False, False]
    }
)
print(df_action)
>>>
    action_id   action_feature_1    action_feature_2    action_feature_3
0    action1         0.15                  A                  True
1    action2         0.00                  B                 False
2    action3         0.25                  C                 False

I then use the Dataset() object from the lightFM module to build digestable features for users, items and their interactions. Here I have two format options for building user/item features feature:

  • (user/item id, [features])
  • or (user/item id, {feature: feature weight})

The latter is mostly used when a mixture of numerical and categorical variables exist in the dataframe which is my case. So i use them like so in python

from lightfm import LightFM
from lightfm.data import Dataset

def prepare_data_inputs(df: pd.DataFrame, features_list: list, id_col: str):
    '''
    Preprocess a pandas dataframe of users/items data to prepare inputs for the Dataset object and the lightFM model.
    The data preparation handles a mixture of numerical and binary features but categorical features need to be
    dummified prior to using the function.
    
    inputs:
    -------
        - df: pandas dataframe that contains either user features (we can have multiple rows per user) or actions features.
        - features_list: list of categorical features to be considered for fitting
        - id_col: the action id or user id column name
    
    outputs:
    --------
        - user_feature_tuple: tuple of ids and feature value pairs (user/item id, {feature: feature weight})
        exemple output : [('u1', {'user_feature_1': 15, 'user_feature_2': True}), ...]
    '''
    # loop over the rows of values for each row in the dataframe
    feature_value_list = []
    for values_list in df[features_list].values:

        # group data in {feature:value} format for each row
        row_feature_value_dict = {}
        for feat, val in zip(features_list, values_list):
            row_feature_value_dict[feat] = val
        
        # append the final results to a list that contains feature:value
        feature_value_list.append(row_feature_value_dict)

    # add user_id at the beginning of the tuple for lightfm input format
    user_feature_tuple = list(zip(df[id_col], feature_value_list))
    return user_feature_tuple

def get_features_list(df: pd.DataFrame, cols_to_ignore: list):
    ''' 
    Get list of features to be used for fitting a dataset object for lightFM with columns to ignore (e.g. id columns and rating).
    If the list cols_to_ignore contains features that are not in dataframe, the not-existing features are ignored.
    '''
    feature_list = [col for col in df.columns if col not in cols_to_ignore]
    return feature_list

# get unique user and item ID's for fitting a dataset object
users_idx_list = df_user_inter['deviceid'].unique()
actions_idx_list = df_user_inter['action_id'].unique()

# get user and items feature names as a list for fitting a dataset object 
user_features_list = get_features_list(df_user_inter, cols_to_ignore = cols_to_ignore)
item_features_list = get_features_list(df_action, cols_to_ignore = cols_to_ignore)

# get user and item features as tuples for Dataset's build_item/user_features method
user_features_tuple = prepare_data_inputs(df_user_inter, user_features_list, id_col='deviceid')
item_features_tuple = prepare_data_inputs(df_action, item_features_list, id_col='action_id')

# create and fit dataset object 
dataset = Dataset()
dataset.fit(
        users=users_idx_list,
        items=actions_idx_list,
        user_features=user_features_list,
        item_features=item_features_list)

I then try to build the user/item features and the user item interactions matrix like so:

# get user item interaction tuple and build user item interaction matrix using dataset's method 
interaction_tuple = [(x[0], x[1], x[2]) for x in df_user_inter[['deviceid', 'action_id', 'rating']].values]
user_item_interactions, interaction_weights = dataset.build_interactions(interaction_tuple)

# build user and item features in the proper lightFM format
user_features = dataset.build_user_features(user_features_tuple, normalize=False)
item_features = dataset.build_item_features(item_features_tuple, normalize=False)

Right, so far so good. I just computed elements that can all be digestable by a lightFM().fit() call. The interesting part is that the call of build_user_features method have summed my features per user. So now I get one row per deviceid, with all its features summed together (e.g. [ 1. 0. 0. 0. 40. 1. 5.] for u1). Can anyone explain to me why it is such a good idea to sum their features that it is the default behavior of this method? and is there a better way to handle features representation for lightFM in case of cross sectional / panel data?

Thanks!

bmasri
  • 354
  • 1
  • 11

0 Answers0