0

I have a text classification dataset where I used dask parquet to save disk space, but run into the problem now when I want to split the dataset into train and test using dask_ml.model_selection.train_test_split.

ddf = dd.read_parquet('/storage/data/cleaned')
y = ddf['category'].values
X = ddf.drop('category', axis=1).values
train, test = train_test_split(X, y, test_size=0.2)

Resulting in TypeError: Cannot operate on Dask array with unknown chunk sizes.

Thanks for the help.

osterburg
  • 447
  • 5
  • 24
  • What happens if your drop all the `.values`? – Inon Peled Mar 31 '19 at 16:33
  • And just out of curiosity, have you formed the parquet file from CSV? If so, I would be happy to hear how you did it. – Inon Peled Mar 31 '19 at 16:36
  • @InonPeled you can convert a csv file to parquet with dask like `dd.read_csv('file.csv').repartition(npartitions=10).to_parquet('your_parquet_directory')`. The number of partitions are up to you, but it is recommended to have the file size around 100MB. – osterburg Mar 31 '19 at 17:29

1 Answers1

1

Here is what I ended up doing for the time being:

ddf = dd.read_parquet('/storage/data/cleaned')
ddf = ddf.to_dask_array(lengths=True)
train, test = train_test_split(ddf, test_size=0.2)

This will create a dask.array of some shape dask.array<array, shape=(3937987, 2), dtype=object, chunksize=(49701, 2)>

osterburg
  • 447
  • 5
  • 24