UPDATED QUESTION
I am trying to build a hybrid recommender system in python based on the lightFM library. The input data contains information on users, actions (things I am trying to recommend) and binary ratings. I also have user features that are updated with every entry i.e. trafic activity, age in days, and a few more in a typical cross sectional data layout. I also have item features that are pretty static that I can use. The two dummy dataframes that represent the problem are the following:
import pandas as pd
# create dummy dataset
df_user_inter = pd.DataFrame(
{
'deviceid': ['u1','u1','u2','u2', 'u3', 'u3', 'u3', 'u4', 'u4', 'u4', 'u4'],
'action_id': ['action1', 'action3', 'action1', 'action2', 'action3', 'action3', 'action1', 'action1', 'action2', 'action3', 'action2'],
'rating': [0,1, 1,0, 1,1,0, 0,0,1,0],
'user_feature_1': [15,25, 28,7,21,18,5,8,2.5,12,0.5],
'user_feature_2': [True, False, False, False, True, True, False, False, False, True, True],
'user_feature_3' : [0,5,4,5,1,2,1,3,2,3,0]
}
)
print(df_user_inter)
>>>
deviceid action_id rating user_feature_1 user_feature_2 user_feature_3
0 u1 action1 0 15.0 True 0
1 u1 action3 1 25.0 False 5
2 u2 action1 1 28.0 False 4
3 u2 action2 0 7.0 False 5
4 u3 action3 1 21.0 True 1
5 u3 action3 1 18.0 True 2
6 u3 action1 0 5.0 False 1
7 u4 action1 0 8.0 False 3
8 u4 action2 0 2.5 False 2
9 u4 action3 1 12.0 True 3
10 u4 action2 0 0.5 True 0
# create dummy item features
df_action = pd.DataFrame(
{
'action_id': ['action1', 'action2', 'action3'],
'action_feature_1': [0.15, 0, 0.25],
'action_feature_2': ['A', 'B', 'C'],
'action_feature_3': [True, False, False]
}
)
print(df_action)
>>>
action_id action_feature_1 action_feature_2 action_feature_3
0 action1 0.15 A True
1 action2 0.00 B False
2 action3 0.25 C False
I then use the Dataset()
object from the lightFM
module to build digestable features for users, items and their interactions. Here I have two format options for building user/item features feature:
- (user/item id, [features])
- or (user/item id, {feature: feature weight})
The latter is mostly used when a mixture of numerical and categorical variables exist in the dataframe which is my case. So i use them like so in python
from lightfm import LightFM
from lightfm.data import Dataset
def prepare_data_inputs(df: pd.DataFrame, features_list: list, id_col: str):
'''
Preprocess a pandas dataframe of users/items data to prepare inputs for the Dataset object and the lightFM model.
The data preparation handles a mixture of numerical and binary features but categorical features need to be
dummified prior to using the function.
inputs:
-------
- df: pandas dataframe that contains either user features (we can have multiple rows per user) or actions features.
- features_list: list of categorical features to be considered for fitting
- id_col: the action id or user id column name
outputs:
--------
- user_feature_tuple: tuple of ids and feature value pairs (user/item id, {feature: feature weight})
exemple output : [('u1', {'user_feature_1': 15, 'user_feature_2': True}), ...]
'''
# loop over the rows of values for each row in the dataframe
feature_value_list = []
for values_list in df[features_list].values:
# group data in {feature:value} format for each row
row_feature_value_dict = {}
for feat, val in zip(features_list, values_list):
row_feature_value_dict[feat] = val
# append the final results to a list that contains feature:value
feature_value_list.append(row_feature_value_dict)
# add user_id at the beginning of the tuple for lightfm input format
user_feature_tuple = list(zip(df[id_col], feature_value_list))
return user_feature_tuple
def get_features_list(df: pd.DataFrame, cols_to_ignore: list):
'''
Get list of features to be used for fitting a dataset object for lightFM with columns to ignore (e.g. id columns and rating).
If the list cols_to_ignore contains features that are not in dataframe, the not-existing features are ignored.
'''
feature_list = [col for col in df.columns if col not in cols_to_ignore]
return feature_list
# get unique user and item ID's for fitting a dataset object
users_idx_list = df_user_inter['deviceid'].unique()
actions_idx_list = df_user_inter['action_id'].unique()
# get user and items feature names as a list for fitting a dataset object
user_features_list = get_features_list(df_user_inter, cols_to_ignore = cols_to_ignore)
item_features_list = get_features_list(df_action, cols_to_ignore = cols_to_ignore)
# get user and item features as tuples for Dataset's build_item/user_features method
user_features_tuple = prepare_data_inputs(df_user_inter, user_features_list, id_col='deviceid')
item_features_tuple = prepare_data_inputs(df_action, item_features_list, id_col='action_id')
# create and fit dataset object
dataset = Dataset()
dataset.fit(
users=users_idx_list,
items=actions_idx_list,
user_features=user_features_list,
item_features=item_features_list)
I then try to build the user/item features and the user item interactions matrix like so:
# get user item interaction tuple and build user item interaction matrix using dataset's method
interaction_tuple = [(x[0], x[1], x[2]) for x in df_user_inter[['deviceid', 'action_id', 'rating']].values]
user_item_interactions, interaction_weights = dataset.build_interactions(interaction_tuple)
# build user and item features in the proper lightFM format
user_features = dataset.build_user_features(user_features_tuple, normalize=False)
item_features = dataset.build_item_features(item_features_tuple, normalize=False)
Right, so far so good. I just computed elements that can all be digestable by a lightFM().fit()
call. The interesting part is that the call of build_user_features
method have summed my features per user. So now I get one row per deviceid
, with all its features summed together (e.g. [ 1. 0. 0. 0. 40. 1. 5.]
for u1
). Can anyone explain to me why it is such a good idea to sum their features that it is the default behavior of this method? and is there a better way to handle features representation for lightFM
in case of cross sectional / panel data?
Thanks!