I have a dataset consisting of around 500 different paragraphs. For each paragraph, I am trying to see whether there is a link to any of the other paragraphs. Based on this I've created paragraph pairs. I previously tried to approach this problem as a binary issues (0 or 1, is there a link or not), but I now want to try ranking (assigning a probability to each paragraph pair).
My issue is: How do I split out my test and train set randomly but keep all paragraph pairs for each paragraph in the same set? For example, for paragraph 1, I want all associated pairs (1-2, 1-3, 1-4, 1-5...1-500) in either the test or train set. My ranking will not work if half the pairs are in the training set for example, since then the ranking for the test set will be missing some pairs...
Format
Paragraph A | Paragraph B | Label | Features...
Paragraph 1 | Paragraph 4 | 1 | ...
Paragraph 2 | Paragraph 6 | 1 | ...
Paragraph 6 | Paragraph 8 | 0 | ...
Paragraph 10 | Paragraph 2 | 1 | ...
I am using the sklearn train_test_split:
import pandas as pd
from sklearn.model_selection import train_test_split
feature_headers = ['tfidf_cosine', 'count_vec_cosine', 'lda_50topics_cosine', 'lda_200topics_cosine']
target_header = ['label']
train_x, test_x, train_y, test_y = train_test_split(result[feature_headers], result[target_header],
train_size=0.7)