I want to lazily create a Dask dataframe from a generator, which looks something like:
[parser.read(local_file_name) for local_file_name in repo.download_files())]
Where both parser.read and repo.download_files return generators (using yield). parser.read yields a dictionary of key-value pairs, which (if I was just using plain pandas) would collect each dictionary in to a list, and then use:
df = pd.DataFrame(parsed_rows)
What's the best way to create a dask dataframe from this? The reason is that a) I don't know necessarily the number of results returned, and b) I don't know the memory allocation of the machine that it will be deployed on.
Alternatively what should I be doing differently (e.g. maybe create a bunch of dataframes and then put those in to dask instead?)
Thanks.