Lazily create dask dataframe from generator

Question

I want to lazily create a Dask dataframe from a generator, which looks something like:

[parser.read(local_file_name) for local_file_name in repo.download_files())]

Where both parser.read and repo.download_files return generators (using yield). parser.read yields a dictionary of key-value pairs, which (if I was just using plain pandas) would collect each dictionary in to a list, and then use:

df = pd.DataFrame(parsed_rows)

What's the best way to create a dask dataframe from this? The reason is that a) I don't know necessarily the number of results returned, and b) I don't know the memory allocation of the machine that it will be deployed on.

Alternatively what should I be doing differently (e.g. maybe create a bunch of dataframes and then put those in to dask instead?)

Thanks.

score 4 · Accepted Answer · answered Oct 01 '16 at 13:34

If you want to use the single-machine Dask scheduler then you'll need to know how many files you have to begin with. This might be something like the following:

filenames = repo.download_files()
dataframes = [delayed(load)(filename) for filename in filenames]
df = dd.from_delayed(dataframes)

If you use the distributed scheduler you can add new computations on the fly, but this is a bit more advanced.

Lazily create dask dataframe from generator

1 Answers1