4

I am using TrainTestSplit in ML.NET, to repeatedly split my data set into a training and test set. In e.g. sklearn, the corresponding function takes a seed as an input, so that it is possible to obtain different splits, but in ML.NET repeated calls to TrainTestSplit seems to return the same split. Is it possible to change the random seed used by TrainTestSplit?

Petter T
  • 3,387
  • 2
  • 19
  • 31
  • 1
    Briefly looking at [the source](https://github.com/dotnet/machinelearning/blob/dd4320d86614baa85d9e205a8b604ab9874b0589/src/Microsoft.ML.Data/Training/TrainingStaticExtensions.cs#L36) it doesn't seem like there is a seed parameter to pass in. Though, it may change in the future to have that functionality. – Jon Nov 15 '18 at 14:07
  • 1
    `train_test_split` also has a parameter `shuffle` which is `True` by default. If you make it `False`, then changing the `random_state` will have no effect. You should investigate if ML has a shuffling utility that can accept a seed. You can then use that randomly shuffle the data before passing `TrainTestSplit`. – Vivek Kumar Nov 16 '18 at 07:04

2 Answers2

4

Right now TrainTestSplit doesn't take a random seed. There is a bug opened in ML.NET to fix this: https://github.com/dotnet/machinelearning/issues/1635

As a short-term workaround, I recommend manually adding a random column to the data view, and using it as a stratificationColumn in TrainTestSplit:

data = new GenerateNumberTransform(mlContext,  new GenerateNumberTransform.Arguments
                {
                    Column = new[] { new GenerateNumberTransform.Column { Name = "random" } },
                    Seed = 42 // change seed to get a different split
                }, data);
(var train, var test) = mlContext.Regression.TrainTestSplit(data, stratificationColumn: "random");

This code will work with ML.NET 0.7, and we will fix the seed in 0.8.

Zruty
  • 8,377
  • 1
  • 25
  • 31
  • Thanks @Zruty. About your workaround: I am already using stratification in my case, will your workaround work then? – Petter T Nov 18 '18 at 09:46
  • 2
    No. But you can hash your stratification column (using different hash seeds) and then use the result as a stratification column. – Zruty Nov 19 '18 at 05:28
  • In the current ML.NET (im using v1.4) this method has been moved from `mlContext.Regression.TrainTestSplit(data)` to `mlContext.Data.TrainTestSplit(data)`. – FlixMa Dec 05 '19 at 09:49
3

As of today (ML.NET v1.0), this has been solved. TrainTestSplit takes a seed as input, and it also supports stratification by setting samplingKeyColumnName:

TrainTestSplit(IDataView data, double testFraction = 0.1, string samplingKeyColumnName = null, Nullable<int> seed = null);
Petter T
  • 3,387
  • 2
  • 19
  • 31