0

I have a dataset consisting of around 500 different paragraphs. For each paragraph, I am trying to see whether there is a link to any of the other paragraphs. Based on this I've created paragraph pairs. I previously tried to approach this problem as a binary issues (0 or 1, is there a link or not), but I now want to try ranking (assigning a probability to each paragraph pair).

My issue is: How do I split out my test and train set randomly but keep all paragraph pairs for each paragraph in the same set? For example, for paragraph 1, I want all associated pairs (1-2, 1-3, 1-4, 1-5...1-500) in either the test or train set. My ranking will not work if half the pairs are in the training set for example, since then the ranking for the test set will be missing some pairs...

Format

Paragraph A | Paragraph B | Label | Features...


Paragraph 1 | Paragraph 4 | 1 | ...

Paragraph 2 | Paragraph 6 | 1 | ...

Paragraph 6 | Paragraph 8 | 0 | ...

Paragraph 10 | Paragraph 2 | 1 | ...

I am using the sklearn train_test_split:

import pandas as pd
from sklearn.model_selection import train_test_split

feature_headers = ['tfidf_cosine', 'count_vec_cosine', 'lda_50topics_cosine', 'lda_200topics_cosine']
target_header = ['label']

train_x, test_x, train_y, test_y = train_test_split(result[feature_headers], result[target_header],
                                                    train_size=0.7)
Mia
  • 559
  • 4
  • 9
  • 21
  • 1
    You need to assign `groups` according to the `paragraphA` and then use either [GroupKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html) or [GroupShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupShuffleSplit.html) – Vivek Kumar May 25 '18 at 12:24

1 Answers1

-1

You are asking us how to make it so you can overfit your model...

My ranking will not work if half the pairs are in the training set for example, since then the ranking for the test set will be missing some pairs...

Your ranking must work if some (most!) of your pairs are not in the test set, otherwise what is the point of generating the network?

In any case, what you are asking is mathematically impossible. The only way you could seperate out the paragraphs the way you ask is if you have two completely unrelated sets with no overlap at all. If you imagine your paragraphs as nodes in a graph and the connections as vectors, your best case scenario is that you end up with two islands with only a single connection between them. If that connection is between paragraphs 1 and 2, then it's clear that both of those must have at least one pairing in each set.

Turksarama
  • 1,136
  • 6
  • 13
  • No, not in this case. Because there are actually references within each paragraph. For example paragraph 1 may contain a reference to paragraph 4. But paragraph 4 may not reference paragraph 1. In this case I am counting paragraph 1 - 4 as a link, and paragraph 4 -1 as no link. :) Sorry I should have explained that better. I am basically looking at "outgoing/one-directional" links. – Mia May 25 '18 at 12:27
  • I'm not looking to make a network. I'm looking to rank each paragraph: Paragraph 1: top 5 most likely links, Paragraph 2: top 10 most likely links, or so on – Mia May 25 '18 at 12:28
  • You still have the same problem. You still have a graph with connections, however they are only in one direction. In order to seperate your sets you need to build the connections first, but the connections are exactly what you're creating the network to find. – Turksarama May 25 '18 at 12:45