Dask: Update published dataset periodically and pull data from other clients

Question

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.

Would that be possible?
Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?

Thanks for any tips and advice.

score 1 · Answer 1 · answered Aug 08 '20 at 01:04

You have a few options here.

You might take a look at the streamz project
You might take a look at Dask's coordination primitives

What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?

Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.

Dask: Update published dataset periodically and pull data from other clients

1 Answers1

Linked