I want to split the following numpy arrays for training and testing: X
, y
and qid
X
is a set of featurized documents - shape: (140, 105)qid
is a set of query identifiers for each document - shape: (140,)y
is a set of labels for each (X
,qid
) pair - shape: (140,)
At the moment, what I do for splitting is:
# Split documents, labels, and query_ids into training (70%) and testing (30%)
X_tr, X_tst, y_tr, y_tst, qid_tr, qid_tst= train_test_split(X, y, qid, test_size=0.3, random_state=1, shuffle=True, stratify=qid)
The problem is that after splitting, I need the returning numpy arrays to be sorted by qid
. That is, all the documents with the same qid
need to be together (one after another) as a block (both in training and testing).
Example
Correct split:
X qid y
------------------------------
document 1 0 0
document 5 0 1
document 4 1 1
document 6 1 0
document 9 2 1
Incorrect split:
X qid y
------------------------------
document 1 0 0
document 4 1 1
document 9 2 1
document 5 0 1
document 6 1 0
Is there any way to make this possible?