Is there a way to use sklearn.model_selection.train_test_split
to retain all unique values from a specific column(s) in the training set.
Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items.
So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use train_test_split
even with a high train_size
ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts'])
ml.head()
User_id Item_id Rating ts
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
ml.User_id.unique().size
943
ml.Item_id.unique().size
1682
utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9)
np.unique(utrain).size
943
np.unique(itrain).size
1644
Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the stratify=
kwarg for train_test_split
as there are not more than 1 entry for all users or for all movies.
My question is this.
Is there a way in sklearn to split a dataset to ensure that the set of unique values from a specific column(s) are retained in the training set?
My rudimentary solution to the problem is as follows.
- Separate the items that/users have a low number of total ratings.
- create a
train_test_split
on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size). - combine the two to get a final representative training set
Example:
item_counts = ml.groupby(['Item_id']).size()
user_counts = ml.groupby(['User_id']).size()
rare_items = item_counts.loc[item_counts <= 5].index.values
rare_users = user_counts.loc[user_counts <= 5].index.values
rare_items.size
384
rare_users.size
0
# We can ignore users in this example
rare_ratings = ml.loc[ml.Item_id.isin(rare_items)]
rare_ratings.shape[0]
968
ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)]
items = ml_less_rare.Item_id.values
users = ml_less_rare.User_id.values
ratings = ml_less_rare.Rating.values
# Establish number of items desired from train_test_split
desired_ratio = 0.9
train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0]
train_ratio = train_size / ml_less_rare.shape[0]
itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio)
itrain = np.concatenate((itrain, rare_ratings.Item_id.values))
np.unique(itrain).size
1682
utrain = np.concatenate((utrain, rare_ratings.User_id.values))
np.unique(utrain).size
943
rtrain = np.concatenate((rtrain, rare_ratings.Rating.values))
This approach works, but I just have to feel there is a way to accomplish the same with train_test_split
or another splitting method from sklearn.
Caveat - Data Contains Single Entries for Users and Movies
While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.