0

I have multiple sets of different lengths and I wish to randomly sort these sets into two supersets such that:

  1. Any one set only appears in one superset and,
  2. The sum of the lengths of all sets in a superset is as close as possible to a defined proportion of the sum of the lengths of all sets.

Example:

Given the following sets:

Set1 Set2 Set3 Set4 Set5 Set6
Length 1 2 3 4 5 6

These are some possible supersets:

Target Proportion Superset1 Superset2
50% - 50% (set1,set2,set3,set4) (1+2+3+4) Total length = 10 (set5,set6) (5+6) Total length = 11
50% - 50% (set4,set6) (4+6) Total length = 10 (set1,set2,set3,set5) (1+2+3+5) Total length = 11
60% - 40% (set2,set5,set6) (2+5+6) Total length = 13 (set1,set3,set4) (1+3+4) Total length = 8
90% - 10% (set2) (2) Total length = 2 (set1,set3,set4,set5,set6) (1+3+4+5+6) Total length = 19

In reality my sets have lengths in the thousands but I have used small values for simplicity of illustration.

The purpose of this task is to split a dataset into training and test sets for machine learning in python and scikit-learn. Usually I would use the train-test split function included with scikit-learn but it is (to my knowledge) inadequate in this case as, rather than a random split of all rows, no row in the training data can share a set with any row in the test data (in this case the 'set' of any given row is one of the columns in the dataset).

So far I have simply used the scikit train-test function to split the sets instead of the actual data rows, but depending on the length of the sets, the training and test sets can obviously be way off the desired proportion.

As an analogy, say I have a list of house prices alongside square footage, garden size, and distance to nearest school, and I have the same list for various different countries. Eventually the task is to predict house prices for countries where we do not have any house price data, but we do have all the other data. So in order to evaluate the performance of our prediction algorithm, the training set must contain entirely different countries from the testing set.

I'm drawing a bit of a blank on how to actually achieve this, presumably there is a name for this general problem but I am unsure what to look for.

Any help or pointers greatly appreciated.

1 Answers1

0

I was able to get this behaviour by using https://github.com/Yoyodyne-Data-Science/GroupStratifiedShuffleSplit. The author describes it thus:

generates stratified and grouped cross validation folds

This creates the desired split proportion while also ensuring that the groups I need to keep separate are still separate in the split.