Sorting train_test_split data by numpy array

Question

I want to split the following numpy arrays for training and testing: X, y and qid

X is a set of featurized documents - shape: (140, 105)
qid is a set of query identifiers for each document - shape: (140,)
y is a set of labels for each (X, qid) pair - shape: (140,)

At the moment, what I do for splitting is:

# Split documents, labels, and query_ids into training (70%) and testing (30%)
    X_tr, X_tst, y_tr, y_tst, qid_tr, qid_tst= train_test_split(X, y, qid, test_size=0.3, random_state=1, shuffle=True, stratify=qid)

The problem is that after splitting, I need the returning numpy arrays to be sorted by qid. That is, all the documents with the same qid need to be together (one after another) as a block (both in training and testing).

Example

Correct split:

X              qid           y       
------------------------------
document 1     0             0
document 5     0             1
document 4     1             1
document 6     1             0
document 9     2             1

Incorrect split:

X              qid           y       
------------------------------
document 1     0             0
document 4     1             1
document 9     2             1
document 5     0             1
document 6     1             0

Is there any way to make this possible?

What do you mean "*another way of splitting the data*"? You want your data split *this* way, no? — desertnaut, Apr 06 '22 at 15:11
I mean, it can be splitted in this way ofc but with all the documents with the same qid together — krakken, Apr 06 '22 at 15:13
You want [`np.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html). I don't know exactly what `qid` is, but be careful with splitting data like this: if the records with a given `qid` are related, you probably do not want to split them across train and val/test (read about 'leakage'). — Matt Hall, Apr 06 '22 at 15:14
You ask 1) if is there any way to make this possible (i.e. as you describe, with all the documents with the same qid together) or 2) another way of splitting the data. And I ask - what do you mean by this "other way"? And what is "ofc"? — desertnaut, Apr 06 '22 at 15:15
@kwinkunks The actual split maintains the relationship between the columns, the problem is that I want the records to be ordered by qid — krakken, Apr 06 '22 at 15:17
@desertnaut 1) or 2) (any of the two options) would be okay for me while the result is the expected one. And ofc == of course — krakken, Apr 06 '22 at 15:20
I am trying to say that #2 does not making any sense; you want your data split *this* way. I am editing out the 2nd part... — desertnaut, Apr 06 '22 at 15:22
_This way_ is my initial approach, but if there is a better way, I could try another one (that was the purpose of #2) — krakken, Apr 06 '22 at 15:26
@krakken I understand, and it sounds like you're sorted now... Just be careful splitting records with the same `qid` across train and test. Depending on exactly what that is, it might be an issue, because models can 'memorize' queries and thereby 'cheat' on validation. — Matt Hall, Apr 06 '22 at 15:38

score 0 · Accepted Answer · answered Apr 06 '22 at 15:26

There is a very simple way to split data into training and test set. While splitting you want to maintain two things:

Your data is shuffled properly, Usually, we have data set in some order and we want to shuffle properly to get better results,
You must get the same set of rows in train and test split each time.

For that, you can simply create a df by joining all your X and qid and y dfs. and then use pandas to shuffle and split into train and test set.

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

Now you can sort the training set based on qid column and split it into multiple dfs to obtain X_train, y_train and qid_train. Do same thing for test set.

Sorting train_test_split data by numpy array

1 Answers1