3

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions.

  1. Would that be possible?
  2. Which append interface should be used? Should I load it into a pd.DataFrame first or better use some text importer?
  3. What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?
  4. Are there other good suggestions to exchange huge and rapidly updating datasets within a dask cluster?

Thanks for any tips and advice.

gies0r
  • 4,723
  • 4
  • 39
  • 50

1 Answers1

1

You have a few options here.

What are the assumed append speeds? Is it possible to append lets say 1k/10k rows in a second?

Dask is just tracking remote data. The speed of your application has a lot more to do with how you choose to represent that data (like python lists vs pandas dataframes) than with Dask. Dask can handle thousands of tasks a second. Each of those tasks could have a single row, or millions of rows. It's up to how you build it.

MRocklin
  • 55,641
  • 23
  • 163
  • 235