24

Is there a way to use sklearn.model_selection.train_test_split to retain all unique values from a specific column(s) in the training set.

Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items.

So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use train_test_split even with a high train_size ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts'])
ml.head()   
   User_id  Item_id Rating         ts
0      196      242      3  881250949
1      186      302      3  891717742
2       22      377      1  878887116
3      244       51      2  880606923
4      166      346      1  886397596
ml.User_id.unique().size
943
ml.Item_id.unique().size
1682
utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9)
np.unique(utrain).size
943
np.unique(itrain).size
1644

Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the stratify= kwarg for train_test_split as there are not more than 1 entry for all users or for all movies.

My question is this.

Is there a way in sklearn to split a dataset to ensure that the set of unique values from a specific column(s) are retained in the training set?

My rudimentary solution to the problem is as follows.

  1. Separate the items that/users have a low number of total ratings.
  2. create a train_test_split on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size).
  3. combine the two to get a final representative training set

Example:

item_counts = ml.groupby(['Item_id']).size()
user_counts = ml.groupby(['User_id']).size()
rare_items = item_counts.loc[item_counts <= 5].index.values
rare_users = user_counts.loc[user_counts <= 5].index.values
rare_items.size
384
rare_users.size
0
# We can ignore users in this example
rare_ratings = ml.loc[ml.Item_id.isin(rare_items)]
rare_ratings.shape[0]
968
ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)]
items = ml_less_rare.Item_id.values
users = ml_less_rare.User_id.values
ratings = ml_less_rare.Rating.values
# Establish number of items desired from train_test_split
desired_ratio = 0.9
train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0]
train_ratio = train_size / ml_less_rare.shape[0]
itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio)
itrain = np.concatenate((itrain, rare_ratings.Item_id.values))
np.unique(itrain).size
1682
utrain = np.concatenate((utrain, rare_ratings.User_id.values))
np.unique(utrain).size
943
rtrain = np.concatenate((rtrain, rare_ratings.Rating.values))

This approach works, but I just have to feel there is a way to accomplish the same with train_test_split or another splitting method from sklearn.

Caveat - Data Contains Single Entries for Users and Movies

While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.

Grr
  • 15,553
  • 7
  • 65
  • 85
  • 1
    So you want all your *rare items* to be in the training set only? or to be duplicated in both the training and test set? I don't think you'll find a function for this in `sklearn`, both approaches are going to mess with your validation metrics, I guess. The first one sounds better anyways – filippo May 23 '18 at 17:20
  • @filippo My thought was it would be best to keep them in training. That is what I was doing with the approach I use. – Grr May 23 '18 at 17:33
  • 1
    My current situation is even a little more complex, since I have a dozen columns where the unique values need to stay in the training set. – herrherr May 24 '18 at 08:13
  • 2
    @Grr what do you hope to achieve from having these very rare items in the training set? – P.Tillmann May 24 '18 at 08:48
  • @P.Tillmann In the movie rating example, depending on the prediction methodology you may not be able to predict for a given user if that user does not exist in the training data. Matrix decomposition would be one such case where an m x n matrix must be decomposed to an m x r and r x n matrix to predict for all m users and n movies. In some cases these rare items actually aren't even rare. I have one data set where > 50% of the data are from users that made a single rating. – Grr May 24 '18 at 13:24
  • 2
    I think the best thing to do for a dataset like this would be to use something like a [stratified K-fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) based approach and average the model's performance on these K-folds, rather than a straightforward train-test split. – cs95 May 27 '18 at 08:05
  • 1
    @coldspeed: Nice idea. How would you stratify if sklearn complains that the classes are too small? (btw: how's life at Google?) – serv-inc May 27 '18 at 14:16
  • 1
    @serv-inc hmm, that might mean there are too many k-folds, I suppose a balance can be found with the right value of k. (And yes, it's everything I had imagined but better ;-).) – cs95 May 27 '18 at 18:59
  • 1
    @coldspeed didn't test `StratifiedKFold` but I think anything *stratified* is going to complain about single element classes as they cannot be split while maintaining stratification in each fold. Maybe something based on sampling with replacement could work, no idea if there's anything ready in `sklearn` – filippo May 30 '18 at 06:08
  • Well you could try converting the datatset into a python set which will remove all duplicate values. – Vipul Rustagi Aug 08 '18 at 08:05

2 Answers2

5

What you are looking for is called stratification. Luckily, sklearn has just that. Just change the line to

itrain, itest, utrain, utest, rtrain, rtest = train_test_split(
     items, users, ratings, train_size=train_ratio, stratify=users)

If stratify is not set, data is shuffled randomly. See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

If [stratify is] not None, data is split in a stratified fashion, using this as the class labels.


Update to the updated question: it seems that putting unique instances into the training set is not built into scikit-learn. You could abuse PredefinedSplit, or extend StratifiedShuffleSplit, but this might be more complicated than simply rolling your own.

serv-inc
  • 35,772
  • 9
  • 166
  • 188
  • 2
    Doesn't `stratify` need at least two samples per class? `train_test_split(ml, train_size=0.9, stratify=ml.Item_id)` gives `ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.`. i.e. how can it split data in two groups keeping stratification if there is only one element to split? – filippo May 23 '18 at 15:00
  • As @filippo pointed out, this approach does not work when there are classes with a single data point, as is the case with this and most other recommendation/ranking datasets. – Grr May 23 '18 at 17:09
  • @filippo: conceptually, and/or from experience, do you think it's possible to learn if you have just one sample per class? Or even less than, say, 30? – serv-inc May 24 '18 at 06:19
  • 1
    @serv-inc With [Inductive Matrix Completion](http://bigdata.ices.utexas.edu/software/inductive-matrix-completion/) you can incorporate side information into recommendation models. In this way a single rating becomes much more valuable as the model learns how side information from users and items interact to result in the given outcome. – Grr May 29 '18 at 14:10
  • 1
    @serv-inc you can "learn", but not much that's useful. However sophisticated the method used, with a single example you can only guess that the sample is an average of the class it is grouped in. It's better than knowing nothing, (and the best you can do with the information available) but you would expect this 'guess' to rarely be close to the information you'd get from a larger sample. – Mark_Anderson Nov 16 '18 at 14:59
  • @Mark_Anderson: Sure, if you lack data, you have a huge problem. this answer just remains here for informational purposes. After the question was updated, there is arguably little that it adds, except as a reference for the question update. – serv-inc Nov 17 '18 at 10:01
  • 2
    Don't sell yourself short. Even if it doesn't help the original asker much, your answer is great for someone who is trying to fix their ML problem via google and finds out that "stratification" is the magic technical word they needed AND you give them a guideline for implementing it. – Mark_Anderson Nov 19 '18 at 16:49
0

Maybe you can groupby your input data on movie then take a sample and then combine all samples into a one large data set.

# initialize lists
utrain_all =[]
utest_all =[]
itrain_all = []
itest_all = []
rtrain_all = []
rtest__all = []

grp_ml = ml.groupby('Item_id')
for name, group in grp_ml:
 utrain, utest, itrain, itest, rtrain, rtest = train_test_split(group, train_size=0.9)
 utrain_all.append(utrain)
 utest_all.append(utest)
 itrain_all.append(itrain)
 .
 .
 .
Mikhail Venkov
  • 358
  • 2
  • 11