Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
2
votes
1 answer

Efficient n-body simulation with dask

An N-body simulation is used to simulated dynamics of a physical system involving particles interactions, or a problem reduced to some kind of particles with physical meaning. A particle could be a gas molecule or a star in a galaxy. Dask.bag…
2
votes
1 answer

flatMap in dask

Many functional languages define flatMap function which works like map but can flatten returning values. Spark/pyspark has it http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.flatMap what would be the best way to have it in…
Yu Chem
  • 85
  • 4
2
votes
1 answer

Dask/hdf5: Read by group?

I must read in and operate independently over many chunks of a large dataframe/numpy array. However, these chunks are chosen in a specific, non-uniform manner and are broken naturally into groups within a hdf5 file. Each group is small enough to fit…
Eric Kaschalk
  • 369
  • 1
  • 4
  • 8
2
votes
1 answer

How to find why a task fails in dask distributed?

I am developing a distributed computing system using dask.distributed. Tasks that I submit to it with the Executor.map function sometimes fail, while others seeming identical, run successfully. Does the framework provide any means to diagnose…
wl2776
  • 4,099
  • 4
  • 35
  • 77
2
votes
0 answers

Error in Dask connecting to HDFS

I was trying to connect to HDFS using dask, following the blog then I installed the hdfs3 from the docs using conda. When I import the hdfs3 it gives me an error: ImportError: Can not find the shared library: libhdfs3.so See installation…
rey
  • 1,213
  • 3
  • 11
  • 14
2
votes
1 answer

How to run dask in multiple machines?

I found Dask recently. I have very basic questions about Dask Dataframe and other data structures. Is Dask Dataframe immutable data type? Is Dask array and Dataframe are lazy data structure? I dont know whether to use dask or spark or pandas for…
Hariprasad
  • 1,611
  • 2
  • 14
  • 19
2
votes
1 answer

Can dask be used to groupby and recode out of core?

I have 8GB csv files and 8GB of RAM. Each file has two strings per row in this form: a,c c,a f,g a,c c,a b,f c,a For smaller files I remove duplicates counting how many copies of each row there were in the first two columns and then recode the…
Simd
  • 19,447
  • 42
  • 136
  • 271
2
votes
2 answers

Automatic start of dask distributed scheduler and workers on Ubuntu 16.04

I am considering different methods to automatically start and control dask distributed scheduler and workers on Ubuntu 16.04. Currently I think that the most relevant option is to use systemd daemon. This requires creation and installation of unit…
wl2776
  • 4,099
  • 4
  • 35
  • 77
2
votes
1 answer

Dask-distributed. How to get task key ID in the function being calculated?

My computations with dask.distributed include creation of intermediate files whose names include UUID4, that identify that chunk of work. pairs = '{}\n{}\n{}\n{}'.format(list1, list2, list3, ...) file_path = os.path.join(job_output_root,…
wl2776
  • 4,099
  • 4
  • 35
  • 77
2
votes
1 answer

Is it possible to get an estimation on execution time of Dask operations

I know its a very specific to the environment running the code but given dask calculates its execution plan in advance into a DAG is there a way to understand how long that execution should take? The progress bar is a great help once execution is…
mobcdi
  • 1,532
  • 2
  • 28
  • 49
2
votes
2 answers

How to visualize dask graphs?

I am following the official docs, however, getting the error during import. F:\>python Python 2.7.11 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:58:36)[MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for…
wl2776
  • 4,099
  • 4
  • 35
  • 77
2
votes
1 answer

Possible to Dask -ify incremental PCA or Stochastic Gradient Descent or other scikit learn partial fit algorithms

Based on Incremental PCA on big data and the incremental PCA docs its suggested to use a memmap array but would it be possible to accomplish the same thing using dask? Update Expanded the question to include other partial fit algorithms as the git…
mobcdi
  • 1,532
  • 2
  • 28
  • 49
2
votes
1 answer

Properly handling Dask multiprocessing in SQLAlchemy

The setting in which I am working can be described as follows: Database and what I want to extract from it The data required to run the analysis is stored in a single de-normalized (more than 100 columns) Oracle table. Financial reporting data is…
sim
  • 1,227
  • 14
  • 20
2
votes
5 answers

How to get n longest entries of DataFrame?

I'm trying to get the n longest entries of a dask DataFrame. I tried calling nlargest on a dask DataFrame with two columns like this: import dask.dataframe as dd df = dd.read_csv("opendns-random-domains.txt", header=None,…
vollkorn
  • 85
  • 2
  • 9
2
votes
1 answer

Is there a way to register Jupyter Notebook Progress Bar Widget instead of Text Progress Bar in Dask/Distributed?

I know that there is a way to globally register dask.diagnostics.ProgressBar, and while it is quite nice, it breaks my cell outputs. I have also seen a nice distributed.diagnostics.progress function, which can present the execution progress with…
Vlad Frolov
  • 7,445
  • 5
  • 33
  • 52