Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Dask scatter broadcast a list

what is the appropriate way to scatter broadcast a list using Dask disitributed? case 1 - wrapping the list: [future_list] = client.scatter([my_list], broadcast=True) case 2 - not wrapping the list: future_list = client.scatter(my_list,…
Thomas Moerman
  • 882
  • 8
  • 16
3
votes
1 answer

Handling large, compressed csv files with Dask

the setup is that I have eight large csv files (32GB each) which are compressed with Zip to 8GB files each. I cannot work with the uncompressed data as I want to save disk space and do not have 32*8GB space left. I cannot load one file with e.g.…
tobiasraabe
  • 427
  • 1
  • 6
  • 12
3
votes
1 answer

Parallel learning using Dask

Scikit-Learn already provides parallel computing on a single machine with Joblib.But l want to use Dask how can l achieve this? from dask.distributed import Client client = Client() how do l proceed from this?
3
votes
1 answer

How to map a dask Series with a large dict

I'm trying to figure out the best way to map a dask Series with a large mapping. The straightforward series.map(large_mapping) issues UserWarning: Large object of size MB detected in task graph and suggests using client.scatter and client.submit…
gsakkis
  • 1,569
  • 1
  • 15
  • 24
3
votes
1 answer

dask future not updating according to progress

problem... I am submitting function to dask client and recording the futures keys. I am using these key to instantiate the futures in a different fucntion. These futures are stacked in "pending" mode backbone of the program I am trying: from…
Dror Hilman
  • 6,837
  • 9
  • 39
  • 56
3
votes
2 answers

Distributed file systems supported by Python/Dask

Which distributed file systems are supported by Dask? Specifically, from which file systems one could read dask.dataframe's? From the Dask documentation I can see that HDFS is certainly supported. Are any other distributed file systems supported,…
S.V
  • 2,149
  • 2
  • 18
  • 41
3
votes
1 answer

How to sort index in Dask following pivot_table

Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this: # make dataframe, fist in pandas and then in dask df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c',…
benten
  • 1,995
  • 2
  • 23
  • 38
3
votes
1 answer

Can we create a Dask cluster having multiple CPU machines as well as multiple GPU machines both.?

Can we create a dask-cluster with some CPU and some GPU machines together. If yes, how to control a certain task must run only on CPU machine, or some other type of task should run only on GPU machine, and if not specified, it should pick whichever…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
3
votes
1 answer

why is multiprocessing slower than a simple computation in Pandas?

This is related to how to parallelize many (fuzzy) string comparisons using apply in Pandas? Consider this simple (but funny) example again: import dask.dataframe as dd import dask.multiprocessing import dask.threaded from fuzzywuzzy import…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
3
votes
2 answers

pandas groupby performance

I'm running on a workstation with lots of ram (190GB). we need to groupby on a datasets with millions of records [ normally with 2 ID columns, 1 type ID column, 1 date column and 3-5 categorical columns] (between 10-30 M), while generating a list of…
skibee
  • 1,279
  • 1
  • 17
  • 37
3
votes
1 answer

Pandas Apply - Return Multiple Rows

I have two data frames and I need to compare the full combinations of rows and return those combinations that meet a criteria. This turns out to be too intensive for our small cluster with Spark (using a cross join) so I am experimenting with this…
B_Miner
  • 1,840
  • 4
  • 31
  • 66
3
votes
1 answer

How to put a dataset on a gcloud kubernetes cluster?

I have a gcloud Kubernetes cluster initialized, and I'm using a Dask Client on my local machine to connect to the cluster, but I can't seem to find any documentation on how to upload my dataset to the cluster. I originally tried to just run Dask…
3
votes
1 answer

How to do row processing and item assignment in Dask

Similar unanswered question: Row by row processing of a Dask DataFrame I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to…
shellcat_zero
  • 1,027
  • 13
  • 20
3
votes
0 answers

Dask doesn't appear to take advantage of Cores/CPU

I'm considering dask for a project. I have a large 3GB csv file with 50,000 variables and 60,000 records. I won't know what fields I need until runtime. I need to apply a function a million times to the dataset. I've used dask delayed to apply…
user204548
  • 25
  • 1
  • 5
3
votes
1 answer

Left merging dask dataframes results to empty dataframe

I have the following code: raw_data = pd.DataFrame({'username':list('ab')*10, 'user_agent': list('cdef')*5, 'method':['POST'] * 20, 'dst_port':[80]*20, 'dst':['1.1.1.1']*20}) past = pd.DataFrame({'user_agent':list('cde'), 'percent':[0.3, 0.3,…
Apostolos
  • 7,763
  • 17
  • 80
  • 150