Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

Pickle error when submitting task using dask

I am trying to execute a simple task(an instance method) using dask(async) framework but it fails with serialization error. Can someone point me in right direction. Here is the code that I am running: from dask.distributed import Client,…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
1 answer

How to efficiently send a large numpy array to the cluster with Dask.array

I have a large NumPy array on my local machine that I want to parallelize with Dask.array on a cluster import numpy as np x = np.random.random((1000, 1000, 1000)) However when I use dask.array I find that my scheduler starts taking up a lot of RAM.…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
4
votes
1 answer

Dask: DataFrame taking forever to compute

I created a Dask dataframe from a Pandas dataframe that is ~50K rows and 5 columns: ddf = dd.from_pandas(df, npartitions=32) I then add a bunch of columns (~30) to the dataframe and try to turn it back into a Pandas dataframe: DATA =…
anon_swe
  • 8,791
  • 24
  • 85
  • 145
4
votes
1 answer

Dask Dataframe split column of list into multiple columns

The same task in Pandas can be easily done with import pandas as pd df = pd.DataFrame({"lists":[[i, i+1] for i in range(10)]}) df[['left','right']] = pd.DataFrame([x for x in df.lists]) But I can't figure out how to do something similar with a…
rpanai
  • 12,515
  • 2
  • 42
  • 64
4
votes
1 answer

"Too many memory regions" error with Dask

When using Dask with Dask array I suddenly get the following error and my kernel that dies / restarts. The console says: BLAS : Program is Terminated. Because you tried to allocate too many memory regions I'm using Anaconda on mac with OpenBLAS.…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
4
votes
1 answer

Add pandas series to dask dataframe

What is the idiomatic way to add a pandas series to a dask dataframe? Pandas is far more flexible for working with data so I often bring parts of dask dataframes into memory, manipulate columns and create new ones. I would then like to add these new…
Zelazny7
  • 39,946
  • 18
  • 70
  • 84
4
votes
2 answers

Multiprocessing fuzzy wuzzy string search - python

I am trying to do string match and bring the match id using fuzzy wuzzy in python. My dataset is huge, dataset1 = 1.8 million records, dataset2 = 1.6 million records. What I tried so far, First I tried to use record linkage package in python,…
ds_user
  • 2,139
  • 4
  • 36
  • 71
4
votes
1 answer

Dask graph execution and memory usage

I am constructing a very large DAG in dask to submit to the distributed scheduler, where nodes operate on dataframes which themselves can be quite large. One pattern is that I have about 50-60 functions that load data and construct pandas dataframes…
Adam Klein
  • 476
  • 1
  • 4
  • 13
4
votes
1 answer

How to manage GPU resources on a single worker in dask distributed?

I have a question about dask distributed. Assume I would like to run a set of tasks that each run on a different number of GPUs, e.g., one task runs on 2 GPUs (type A), whereas several others run on 1 GPU (type B). My understanding is that it is…
Celvin
  • 43
  • 2
4
votes
1 answer

Slicing for n individual elements in a dask array

Say I have a 3D dask array representing a time series of temperature for the whole U.S., [Time, Lat, Lon]. I want to get tabular time series for 100 different locations. With numpy fancy indexing this would look something like [:, [lat1, lat2...],…
4
votes
2 answers

call and return cv2.cvtColor in dask.array.map_blocks [OpenCV, Dask]

I try to perform color conversion from 3 channel to 1 channel in parallel fashion using dask. I try this hopefully so that I can perform out-of-memory computation in the future. I use da.map_blocks. from dask.array.image import imread import…
DrSensor
  • 487
  • 6
  • 18
4
votes
2 answers

dask.dataframe's to_parquet support server side encryption?

Our company has a requirement to encrypt all data that is at rest in S3. Usually when we upload s3 object, we do something like: aws s3 cp a.txt s3://b/test --sse I am playing with dask.dataframe and want to export one of my dataset into parquet…
DigitalPig
  • 83
  • 6
4
votes
0 answers

GroupBy on dask arrays

To discover dask, I am currently implementing a K-Means algorithm. To update the means, I want to use a groupBy, but I have to transform my dask.array into a dask.dataframe, then get back to a dask.array : def update(X, Label): '''Update the…
Maxime Maillot
  • 397
  • 2
  • 8
4
votes
1 answer

How to create Dask DataFrame from a list of urls?

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that? Here is an example: link = 'http://web.mta.info/developers/' data = [ …
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
4
votes
2 answers

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df…
SRB
  • 41
  • 1
  • 2