2

I want to load a large dataset, apply some transformations to some fields, sample a small section from the results and store as files so I can later on just load from there.

Basically something like this:

ds = datasets.load_dataset("XYZ", name="ABC", split="train", streaming=True)
ds = ds.map(_transform_record)
ds.shuffle()[:N].save_to_disk(...)

IterableDataset doesn't have a save_to_disk() method. Makes sense as it's backed by an iterator, but then I'd expect some way to convert an iterable to a regular dataset (by iterating over it all and store in memory/disk, nothing too fancy).

I tried to use Dataset.from_generator() and use the IterableDataset as the generator (iter(ds)), but it doesn't work as it's trying to serialize the generator object.

Is there an easy way, like to_iterable_dataset() just vice-versa?

Zach Moshe
  • 2,782
  • 4
  • 24
  • 40

1 Answers1

1

You must cache an IterableDataset to disk to load it as a Dataset. One way to do this is with Dataset.from_generator:

from functools import partial
from datasets import Dataset

def gen_from_iterable_dataset(iterable_ds)
    yield from iterable_ds

ds = Dataset.from_generator(partial(gen_from_iterable_dataset, iterable_ds), features=iterable_ds.features})
HappyFace
  • 3,439
  • 2
  • 24
  • 43
  • That works, thanks! The only change in my pseudo-code is that the slice operator "[..]" isn't implemented on an IterableDataset, but `take()`can be used instead. – Zach Moshe Jul 15 '23 at 10:06