-1

I want to split the following pivot table into training and testing sets (to evaluate recommendation system), and was thinking of extracting two tables with non-overlapping indices (userID) and column values (ISBN). How can I split it properly? Thank you.

enter image description here

Helen Grey
  • 439
  • 6
  • 16
  • 2
    If you have `scikit-learn` library, it has pretty good `test train split` function that can make splitting a dataframe very simple and easy. – moys Nov 28 '19 at 01:48
  • 2
    Just FYI regarding your table, usually the format of recommendation data comes in the form of `user, product, rating`. The problem with your matrix is that there will be many values that have either null ratings or zero ratings (if zero represents a non-rating). This will result in a giant table with significant memory overhead and will grow exponentially when your user/product data grows. – Scratch'N'Purr Nov 28 '19 at 02:15

1 Answers1

0

As suggested by @moys, can use train_test_split from scikit-learn after splitting your dataframe columns first for the non-overlapping column names.

Example:

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split

Generate data:

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

Split df columns in some way, eg half:

cols = int(len(df.columns)/2) df_A = df.iloc[:, 0:cols] df_B = df.iloc[:, cols:]

Use train_test_split:

train_A, test_A = train_test_split(df_A, test_size=0.33) train_B, test_B = train_test_split(df_B, test_size=0.33)

jpalm
  • 23
  • 3