Questions tagged [dask-distributed]

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

1090 questions
0
votes
1 answer

Exception raised when using client.scatter(df) in Dask.distributed

I'm working with Dask on Kubernetes using the Helm Chart in the stable/dask repository. When using the distributed Client, and calling client.scatter(ddf), I'm getting and an Exception as follows: Exception: No module named…
GHayes
  • 55
  • 5
0
votes
1 answer

How to programmatically get the Dask-YARN UI url

I am using Dask YARN to create an application like this: spec = skein.ApplicationSpec( ... ) cluster = YarnCluster.from_specification(spec) client = Client(cluster) ordinarily I'd then run yarn application -list from the command line and get the…
gallamine
  • 865
  • 2
  • 12
  • 26
0
votes
1 answer

How can I combine sequential as well as parallel execution of delayed function calls?

I am stuck in a strange place. I have a bunch of delayed function calls that I want to execute in a certain order. While executing in parallel is trivial: res = client.compute([myfuncs]) res = client.gather(res) I can't seem to find a way to…
suvayu
  • 4,271
  • 2
  • 29
  • 35
0
votes
1 answer

Submit dask arrays to distributed client while using results at the same time

I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this: my_frames = [[arr1f1,…
djhoese
  • 3,567
  • 1
  • 27
  • 45
0
votes
1 answer

Dask workers on Kubernetes cannot find csv file

I have setup Dask and JupyterHub on a Kubernetes cluster using Helm with the help of the Dask documentation: http://docs.dask.org/en/latest/setup/kubernetes.html. Everything deployed fine and I can access the JupyterLab. Then I've create a notebook…
Stanko
  • 4,275
  • 3
  • 23
  • 51
0
votes
1 answer

Dask dataframe reshuffeling on many parquet files

I have a dask cluster spread around many worker nodes. I also have a S3 bucket with as many parquet files (right now 500k files, might three times the size in the future). The data in the parquet is mostly text: [username, first_name, last_name,…
t_z
  • 96
  • 2
  • 5
0
votes
1 answer

How to Create a dask dataframe from from a data string seperated by tabs and new line characters

I've my data in form of a string seperated by \ character (for columns) & by new line \n character for rows. ID\Product\quantity\n1\xx\2 Looks like Dask.array.from_array() support only a array as input. Although I can convert the above text to…
0
votes
1 answer

dask can not read the file that pandas can

I have a csv file that can be accessed using pandas but fails with dask dataframe. I am using exact same parameters and still getting error with dask. Pandas use case: import pandas as pd mycols = ['id', 'tran_id', 'client_id', 'm_text', 'retry',…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
0
votes
1 answer

disable errors while reading csv file

Does dask dataframe pass the error bad lines parameter to pandas DataFrame class? In other words, this does not seem to work because I get an error when I try to run groupby query. df = dd.read_csv('s3://todel162xx/some.csv' , error_bad_lines=False,…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
0
votes
1 answer

can not load large files using aws-fargate ecs

I tried to follow the instructions mentioned on this page... https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4 And got 2 errors. One is related to IAM role and the other is…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
0
votes
1 answer

Dask.distributed cluster administration

I'm setting up Dask Python cluster at work (30 machines, 8 cores each in average). People use only a portion of their CPU power, so dask-workers will be running on background at low priority. All workers are listening to dask-scheduler on my master…
stkubr
  • 371
  • 1
  • 5
  • 15
0
votes
1 answer

Send SIGTERM to the running task, dask distributed

When I submit a small Tensorflow training as a single task, it launches additional threads. When I press Ctrl+C and raise KeyboardInterrupt my task is closed but underlying threads are not cleaned up and training continues. Initially, I was thinking…
0
votes
1 answer

Dask dashboard not starting when starting scheduler with api

I've set up a distributed system using dask. When I start the scheduler using the Python API, the dask scheduler doesn't mention starting the dashboard. As expected, I can not reach it on the address I would expect it to be. Since bokeh is…
mathivh
  • 13
  • 7
0
votes
2 answers

scrapy getting stuck after some time

I have a master-worker network on aws ec2 using dask distributed library. For now i have one master machine and one worker machine. Master has REST api (flask) for scheduling scrapy jobs on worker machine. I am using docker for both master and…
0
votes
1 answer

dask distributed , fail to start worker

There are cases where it seems the the dask cluster hang upon restart to simulate this i have written this stupid code: import contextlib2 from distributed import Client, LocalCluster for i in xrange(100): print i with…
sami
  • 501
  • 2
  • 6
  • 18