Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Dask scatter broadcast a list

what is the appropriate way to scatter broadcast a list using Dask disitributed? case 1 - wrapping the list: [future_list] = client.scatter([my_list], broadcast=True) case 2 - not wrapping the list: future_list = client.scatter(my_list,…

broadcast dask dask-distributed

asked Jun 11 '18 at 10:42

Thomas Moerman

votes

1 answer

Handling large, compressed csv files with Dask

the setup is that I have eight large csv files (32GB each) which are compressed with Zip to 8GB files each. I cannot work with the uncompressed data as I want to save disk space and do not have 32*8GB space left. I cannot load one file with e.g.…

python csv compression dask

asked Jun 07 '18 at 13:01

tobiasraabe

votes

1 answer

Parallel learning using Dask

Scikit-Learn already provides parallel computing on a single machine with Joblib.But l want to use Dask how can l achieve this? from dask.distributed import Client client = Client() how do l proceed from this?

python-3.x machine-learning dask

asked Jun 05 '18 at 08:20

Deep Analytics

votes

1 answer

How to map a dask Series with a large dict

I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size MB detected in task graph and suggests using client.scatter and client.submit…

python dask dask-distributed

asked Jun 01 '18 at 07:56

gsakkis

1,569
1
15
24

votes

1 answer

dask future not updating according to progress

problem... I am submitting function to dask client and recording the futures keys. I am using these key to instantiate the futures in a different fucntion. These futures are stacked in "pending" mode backbone of the program I am trying: from…

python-3.x multiprocessing future dask

asked May 15 '18 at 06:29

Dror Hilman

6,837
9
39
56

votes

2 answers

Distributed file systems supported by Python/Dask

Which distributed file systems are supported by Dask? Specifically, from which file systems one could read dask.dataframe's? From the Dask documentation I can see that HDFS is certainly supported. Are any other distributed file systems supported,…

python hdfs dask ceph distributed-filesystem

asked May 11 '18 at 20:01

S.V

2,149
2
18
41

votes

1 answer

How to sort index in Dask following pivot_table

Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this: # make dataframe, fist in pandas and then in dask df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c',…

python pandas indexing pivot-table dask

asked Apr 16 '18 at 19:47

benten

1,995
2
23
38

votes

1 answer

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…

dask dask-distributed dask-delayed

asked Apr 16 '18 at 10:09

TheCodeCache

votes

1 answer

why is multiprocessing slower than a simple computation in Pandas?

This is related to how to parallelize many (fuzzy) string comparisons using apply in Pandas? Consider this simple (but funny) example again: import dask.dataframe as dd import dask.multiprocessing import dask.threaded from fuzzywuzzy import…

python pandas multiprocessing python-multiprocessing dask

asked Apr 15 '18 at 00:53

ℕʘʘḆḽḘ

18,566
34
128
235

votes

2 answers

pandas groupby performance

I'm running on a workstation with lots of ram (190GB). we need to groupby on a datasets with millions of records [ normally with 2 ID columns, 1 type ID column, 1 date column and 3-5 categorical columns] (between 10-30 M), while generating a list of…

pandas numpy pandas-groupby dask

asked Apr 09 '18 at 13:16

skibee

1,279
1
17
37

votes

1 answer

Pandas Apply - Return Multiple Rows

I have two data frames and I need to compare the full combinations of rows and return those combinations that meet a criteria. This turns out to be too intensive for our small cluster with Spark (using a cross join) so I am experimenting with this…

python pandas dask

asked Apr 05 '18 at 20:55

B_Miner

1,840
4
31
66

votes

1 answer

How to put a dataset on a gcloud kubernetes cluster?

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster. I originally tried to just run Dask…

kubernetes google-cloud-platform dask dask-distributed

asked Apr 05 '18 at 13:45

Brendan Martin

votes

1 answer

How to do row processing and item assignment in Dask

Similar unanswered question: Row by row processing of a Dask DataFrame I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to…

python pandas dataframe dask

asked Apr 03 '18 at 21:44

shellcat_zero

1,027
13
20

votes

0 answers

Dask doesn't appear to take advantage of Cores/CPU

I'm considering dask for a project. I have a large 3GB csv file with 50,000 variables and 60,000 records. I won't know what fields I need until runtime. I need to apply a function a million times to the dataset. I've used dask delayed to apply…

python delay dask

asked Mar 29 '18 at 03:03

user204548

votes

1 answer

Left merging dask dataframes results to empty dataframe

I have the following code: raw_data = pd.DataFrame({'username':list('ab')*10, 'user_agent': list('cdef')*5, 'method':['POST'] * 20, 'dst_port':[80]*20, 'dst':['1.1.1.1']*20}) past = pd.DataFrame({'user_agent':list('cde'), 'percent':[0.3, 0.3,…

python dataframe dask

asked Mar 26 '18 at 06:44

Apostolos

7,763
17
80
150

Prev 1 2 3

…

99 100 Next