I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks @jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
- Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
- For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
- For each set type (i.e. train, test, validation):
- Create a new recipe from one of the sets
- Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!