Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Pickle error when submitting task using dask

I am trying to execute a simple task(an instance method) using dask(async) framework but it fails with serialization error. Can someone point me in right direction. Here is the code that I am running: from dask.distributed import Client,…

python dask dask-distributed

asked Aug 31 '17 at 08:33

Santosh Kumar

votes

1 answer

How to efficiently send a large numpy array to the cluster with Dask.array

I have a large NumPy array on my local machine that I want to parallelize with Dask.array on a cluster import numpy as np x = np.random.random((1000, 1000, 1000)) However when I use dask.array I find that my scheduler starts taking up a lot of RAM.…

numpy dask

asked Aug 29 '17 at 14:20

MRocklin

55,641
23
163
235

votes

1 answer

Dask: DataFrame taking forever to compute

I created a Dask dataframe from a Pandas dataframe that is ~50K rows and 5 columns: ddf = dd.from_pandas(df, npartitions=32) I then add a bunch of columns (~30) to the dataframe and try to turn it back into a Pandas dataframe: DATA =…

python pandas dask

asked Jul 27 '17 at 22:51

anon_swe

8,791
24
85
145

votes

1 answer

Dask Dataframe split column of list into multiple columns

The same task in Pandas can be easily done with import pandas as pd df = pd.DataFrame({"lists":[[i, i+1] for i in range(10)]}) df[['left','right']] = pd.DataFrame([x for x in df.lists]) But I can't figure out how to do something similar with a…

python pandas dataframe dask

asked Jul 21 '17 at 21:06

rpanai

12,515
2
42
64

votes

1 answer

"Too many memory regions" error with Dask

When using Dask with Dask array I suddenly get the following error and my kernel that dies / restarts. The console says: BLAS : Program is Terminated. Because you tried to allocate too many memory regions I'm using Anaconda on mac with OpenBLAS.…

dask openblas

asked Jul 13 '17 at 16:21

MRocklin

55,641
23
163
235

votes

1 answer

Add pandas series to dask dataframe

What is the idiomatic way to add a pandas series to a dask dataframe? Pandas is far more flexible for working with data so I often bring parts of dask dataframes into memory, manipulate columns and create new ones. I would then like to add these new…

python dataframe dask

asked Jun 29 '17 at 21:00

Zelazny7

39,946
18
70
84

votes

2 answers

Multiprocessing fuzzy wuzzy string search - python

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records. What I tried so far, First I tried to use record linkage package in python,…

python python-2.7 python-multiprocessing dask fuzzywuzzy

asked Jun 21 '17 at 03:23

ds_user

2,139
4
36
71

votes

1 answer

Dask graph execution and memory usage

I am constructing a very large DAG in dask to submit to the distributed scheduler, where nodes operate on dataframes which themselves can be quite large. One pattern is that I have about 50-60 functions that load data and construct pandas dataframes…

python dask dask-delayed

asked Jun 06 '17 at 23:38

Adam Klein

votes

1 answer

How to manage GPU resources on a single worker in dask distributed?

I have a question about dask distributed. Assume I would like to run a set of tasks that each run on a different number of GPUs, e.g., one task runs on 2 GPUs (type A), whereas several others run on 1 GPU (type B). My understanding is that it is…

python dask

asked May 20 '17 at 17:53

Celvin

votes

1 answer

Slicing for n individual elements in a dask array

Say I have a 3D dask array representing a time series of temperature for the whole U.S., [Time, Lat, Lon]. I want to get tabular time series for 100 different locations. With numpy fancy indexing this would look something like [:, [lat1, lat2...],…

python arrays numpy dask

asked May 10 '17 at 17:37

Philip Blankenau

votes

2 answers

call and return cv2.cvtColor in dask.array.map_blocks [OpenCV, Dask]

I try to perform color conversion from 3 channel to 1 channel in parallel fashion using dask. I try this hopefully so that I can perform out-of-memory computation in the future. I use da.map_blocks. from dask.array.image import imread import…

python opencv dask

asked Apr 13 '17 at 10:00

DrSensor

votes

2 answers

dask.dataframe's to_parquet support server side encryption?

Our company has a requirement to encrypt all data that is at rest in S3. Usually when we upload s3 object, we do something like: aws s3 cp a.txt s3://b/test --sse I am playing with dask.dataframe and want to export one of my dataset into parquet…

python encryption amazon-s3 dask

asked Apr 12 '17 at 03:49

DigitalPig

votes

0 answers

GroupBy on dask arrays

To discover dask, I am currently implementing a K-Means algorithm. To update the means, I want to use a groupBy, but I have to transform my dask.array into a dask.dataframe, then get back to a dask.array : def update(X, Label): '''Update the…

python arrays dask

asked Apr 10 '17 at 15:40

Maxime Maillot

votes

1 answer

How to create Dask DataFrame from a list of urls?

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ …

python pandas dask

asked Mar 29 '17 at 21:15

Philipp_Kats

3,872
3
27
44

votes

2 answers

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df…

python dask

asked Mar 23 '17 at 00:56

SRB

Prev 1 2 3

…

99 100 Next