5

Having 500, continously growing DataFrames, I would like to submit operations on the (for each DataFrame indipendent) data to dask. My main question is: Can dask hold the continously submitted data, so I can submit a function on all the submitted data - not just the newly submitted?

But lets explain it on an example:

Creating a dask_server.py:

from dask.distributed import Client, LocalCluster
HOST = '127.0.0.1'
SCHEDULER_PORT = 8711
DASHBOARD_PORT = ':8710'

def run_cluster():
    cluster = LocalCluster(dashboard_address=DASHBOARD_PORT, scheduler_port=SCHEDULER_PORT, n_workers=8)
    print("DASK Cluster Dashboard = http://%s%s/status" % (HOST, DASHBOARD_PORT))
    client = Client(cluster)
    print(client)
    print("Press Enter to quit ...")
    input()

if __name__ == '__main__':
    run_cluster()

Now I can connect from my my_stream.py and start to submit and gather data:

DASK_CLIENT_IP = '127.0.0.1'
dask_con_string = 'tcp://%s:%s' % (DASK_CLIENT_IP, DASK_CLIENT_PORT)
dask_client = Client(self.dask_con_string)

def my_dask_function(lines):
    return lines['a'].mean() + lines['b'].mean

def async_stream_redis_to_d(max_chunk_size = 1000):
    while 1:

        # This is a redis queue, but can be any queueing/file-stream/syslog or whatever
        lines = self.queue_IN.get(block=True, max_chunk_size=max_chunk_size)

        futures = []
        df = pd.DataFrame(data=lines, columns=['a','b','c'])
        futures.append(dask_client.submit(my_dask_function, df))

        result = self.dask_client.gather(futures)
        print(result)

        time sleep(0.1)

if __name__ == '__main__':
    max_chunk_size = 1000
    thread_stream_data_from_redis = threading.Thread(target=streamer.async_stream_redis_to_d, args=[max_chunk_size])
    #thread_stream_data_from_redis.setDaemon(True)
    thread_stream_data_from_redis.start()
    # Lets go

This works as expected and it is really quick!!!

But next, I would like to actually append the lines first before the computation takes place - And wonder if this is possible? So in our example here, I would like to calculate the mean over all lines which have been submitted, not only the last submitted ones.

Questions / Approaches:

  1. Is this cummulative calculation possible?
  2. Bad Alternative 1: I cache all lines locally and submit all the data to the cluster every time a new row arrives. This is like an exponential overhead. Tried it, it works, but it is slow!
  3. Golden Option: Python Program 1 pushes the data. Than it would be possible to connect with another client (from another python program) to that cummulated data and move the analysis logic away from the inserting logic. I think Published DataSets are the way to go, but are there applicable for this high-speed appends?

Maybe related: Distributed Variables, Actors Worker

gies0r
  • 4,723
  • 4
  • 39
  • 50
  • I havne't read through the example fully but my first thought is that perhaps https://streamz.readthedocs.io/ might be a good tool for your use case. streamz is tool for handling streams with Pandas and Dask – quasiben May 14 '20 at 13:25
  • Already saw it and tried it a bit.. But the documentation is **not really** making sure how to manage persistant storage inside the stream. And there are only very few real-world examples documented online. Some good talks on youtube.. But I am not sure if I need to involve it if I can persistantly store information using `distributed dask`... So just saying that `streamz` is a still newer project and I did not find a lot of practical examples which comes very close to application. But it could be a good way, yes. – gies0r May 14 '20 at 13:37
  • Nobody give it a try? Even on bounty? Is there a better place to ask questions around `dask.distributed`? – gies0r May 18 '20 at 22:54
  • You could indeed manage futures yourself and chain them to accumulate results... but mybe you want to look at actors https://distributed.dask.org/en/latest/actors.html ? Also, streamz really *is* built for this kind of thing, whereas dask/distributed is not normally stateful (e.g., workers can come and go) – mdurant May 19 '20 at 13:52

1 Answers1

1

Assigning a list of futures to a published dataset seems ideal to me. This is relatively cheap (everything is metadata) and you'll be up-to-date as of a few milliseconds

client.datasets["x"] = list_of_futures

def worker_function(...):
    futures = get_client().datasets["x"]
    data = get_client.gather(futures)
    ... work with data

As you mention there are other systems like PubSub or Actors. From what you say though I suspect that Futures + Published datasets are simpler and a more pragmatic option.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thanks for your answer and for Dask at all! I am not really 100% sure if I understand you correctly: So what is `list_of_futures` holding? From the code I understand that you push these `futures` using one `client` connection and than (within the actual worker), you get the `dataset` back using an inner-worker connection (due to the scheduling the code will most-likely be executed on the same worker which holds the `dataset`, so it will be very fast.) But what is pushed in `list_of_features`? – gies0r May 23 '20 at 19:25
  • Ok so I did some load testing with `get_dataset(..)` and I pretty soon get into `"Detected race condition where multiple asynchronous clients tried entering the as_current() context manager at the same time. Please upgrade to Python 3.7+."` So on Ubuntu18.. It means setting up some stuff - So just to make Sure: Are the datasets are *like there* to be queried in parallel within the workers? It looks like a race condition can also occur when multiple workers asking for different `datasets`? Or is one `dataset` completly unique from the other `datasets`? How many read operations would you assume? – gies0r May 24 '20 at 01:21
  • From reading the error message it looks like there is a known issue with older versions of Python. If it is easy for to to upgrade I recommend trying that. – MRocklin May 25 '20 at 15:30