Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
1 answer

Safe & performant way to modify dask dataframe

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe &…
evilkonrex
  • 255
  • 2
  • 10
5
votes
1 answer

Complex filtering in dask DataFrame

I'm used to doing "complex" filtering on pandas DataFrame objects: import numpy as np import pandas as pd data = pd.DataFrame(np.random.random((10000, 2)) * 512, columns=["x", "y"]) data2 = data[np.sqrt((data.x - 200)**2 + (data.y - 200)**2) <…
David Hoffman
  • 2,205
  • 1
  • 16
  • 30
5
votes
0 answers

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

I can pivot a smaller dataset fine using pandas, dask, or pyspark. However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way…
user798719
  • 9,619
  • 25
  • 84
  • 123
5
votes
1 answer

Dask dataframes known_divisions and performance

I have several files whose with a column called idx and I would like to use it as index. The dataframe obtained has about 13M row. I know that I can read and assign index in this way (which is slow ~40 s) df = dd.read_parquet("file-*.parq") df =…
rpanai
  • 12,515
  • 2
  • 42
  • 64
5
votes
1 answer

Groupby and apply pandas vs dask

there is something that I quite don't understand about dask.dataframe behavior. Let say I want to replicate this from pandas import pandas as pd import dask.dataframe as dd import random s = "abcd" lst = 10*[0]+list(range(1,6)) n = 100 df =…
rpanai
  • 12,515
  • 2
  • 42
  • 64
5
votes
1 answer

Dask: Getting the Row which has the max value in groups using groupby

The same problem can be solved in Pandas using transform as explained here With dask the only working solution I found use merge. And I was wondering if there are other ways to achieve it.
rpanai
  • 12,515
  • 2
  • 42
  • 64
5
votes
2 answers

Parallel excel sheet read from dask

Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call. if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them…
schuler
  • 175
  • 2
  • 4
  • 12
5
votes
1 answer

Distributing rows amongst partitions in a Dask DataFrame

Expectation: I would expect that, when I partition a given dataframe, the rows will be roughly evenly distributed into each partition. I would then expect that, when I write the dataframe to csv, the resulting n csvs (in this case, 10), would…
kuanb
  • 1,618
  • 2
  • 20
  • 42
5
votes
1 answer

Using dask as for task scheduling to run machine learning models in parallel

So basically what I want is to run ML Pipelines in parallel. I have been using scikit-learn, and I have decided to use DaskGridSearchCV. I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator) objects, and I run each…
Larissa Leite
  • 1,358
  • 3
  • 21
  • 36
5
votes
1 answer

Dask Distributed Diagnostic Webpage not working

I've gotten dask up and running on my cluster, but I can't seem to access the diagnostic webpage. The landing page is visible, see below: But all the links just hang and never load the page. The scheduler started fine with this…
David Hoffman
  • 2,205
  • 1
  • 16
  • 30
5
votes
1 answer

Getting year and week from a datetime series in a dask dataframe?

If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I…
user1566200
  • 1,826
  • 4
  • 27
  • 47
5
votes
1 answer

Using all cores in Dask

I am working on a google cloud computing instance with 24 vCPUs. The code running is the following import dask.dataframe as dd from distributed import Client client = Client() #read data logd = (dd.read_csv('vol/800000test', sep='\t',…
JuanPabloMF
  • 397
  • 1
  • 3
  • 14
5
votes
3 answers

Dask not installing graphviz dependency

I'm getting an error when I try to import dask.dot that it can't find the graphviz install. However, both graphviz and pygraphviz are installed. balter@exalab3:~$ conda install dask Fetching package metadata ........... Solving package…
abalter
  • 9,663
  • 17
  • 90
  • 145
5
votes
1 answer

Repeated task execution using the distributed Dask scheduler

I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute(). When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15…
Daniel
  • 53
  • 4
5
votes
0 answers

Pathos, Dask, futures, which one to use for parallel cluster application?

I am confused here. I have an application that is CPU bounded so I went to implementing a parallelisation using multiprocess to overcome GIL issues. I first tried to use multiprocessing and futures but I faced a pickling issue so I went to pathos…
tupui
  • 5,738
  • 3
  • 31
  • 52