Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Safe & performant way to modify dask dataframe

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe &…

dask dask-distributed

asked Sep 05 '17 at 10:24

evilkonrex

votes

1 answer

Complex filtering in dask DataFrame

I'm used to doing "complex" filtering on pandas DataFrame objects: import numpy as np import pandas as pd data = pd.DataFrame(np.random.random((10000, 2)) * 512, columns=["x", "y"]) data2 = data[np.sqrt((data.x - 200)**2 + (data.y - 200)**2) <…

python pandas numpy dataframe dask

asked Aug 15 '17 at 02:03

David Hoffman

2,205
1
16
30

votes

0 answers

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

I can pivot a smaller dataset fine using pandas, dask, or pyspark. However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way…

pandas pyspark dask

asked Aug 13 '17 at 08:26

user798719

9,619
25
84
123

votes

1 answer

Dask dataframes known_divisions and performance

I have several files whose with a column called idx and I would like to use it as index. The dataframe obtained has about 13M row. I know that I can read and assign index in this way (which is slow ~40 s) df = dd.read_parquet("file-*.parq") df =…

python dask

asked Aug 07 '17 at 21:35

rpanai

12,515
2
42
64

votes

1 answer

Groupby and apply pandas vs dask

there is something that I quite don't understand about dask.dataframe behavior. Let say I want to replicate this from pandas import pandas as pd import dask.dataframe as dd import random s = "abcd" lst = 10*[0]+list(range(1,6)) n = 100 df =…

python pandas group-by apply dask

asked Jul 14 '17 at 16:16

rpanai

12,515
2
42
64

votes

1 answer

Dask: Getting the Row which has the max value in groups using groupby

The same problem can be solved in Pandas using transform as explained here With dask the only working solution I found use merge. And I was wondering if there are other ways to achieve it.

python dataframe group-by dask

asked Jun 30 '17 at 21:49

rpanai

12,515
2
42
64

votes

2 answers

Parallel excel sheet read from dask

Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call. if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them…

python-2.7 dask

asked Jun 20 '17 at 13:47

schuler

votes

1 answer

Distributing rows amongst partitions in a Dask DataFrame

Expectation: I would expect that, when I partition a given dataframe, the rows will be roughly evenly distributed into each partition. I would then expect that, when I write the dataframe to csv, the resulting n csvs (in this case, 10), would…

python pandas dask

asked Jun 16 '17 at 20:15

kuanb

1,618
2
20
42

votes

1 answer

Using dask as for task scheduling to run machine learning models in parallel

So basically what I want is to run ML Pipelines in parallel. I have been using scikit-learn, and I have decided to use DaskGridSearchCV. I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator) objects, and I run each…

python multithreading scikit-learn dask dask-delayed

asked May 08 '17 at 00:33

Larissa Leite

1,358
3
21
36

votes

1 answer

Dask Distributed Diagnostic Webpage not working

I've gotten dask up and running on my cluster, but I can't seem to access the diagnostic webpage. The landing page is visible, see below: But all the links just hang and never load the page. The scheduler started fine with this…

python parallel-processing distributed-computing dask

asked Apr 18 '17 at 23:27

David Hoffman

2,205
1
16
30

votes

1 answer

Getting year and week from a datetime series in a dask dataframe?

If I have a Pandas dataframe, and a column that is a datetime type, I can get the year as follows: df['year'] = df['date'].dt.year With a dask dataframe, that does not work. If I compute first, like this: df['year'] = df['date'].compute().dt.year I…

python performance date pandas dask

asked Mar 14 '17 at 21:58

user1566200

1,826
4
27
47

votes

1 answer

Using all cores in Dask

I am working on a google cloud computing instance with 24 vCPUs. The code running is the following import dask.dataframe as dd from distributed import Client client = Client() #read data logd = (dd.read_csv('vol/800000test', sep='\t',…

dask

asked Mar 07 '17 at 18:14

JuanPabloMF

votes

3 answers

Dask not installing graphviz dependency

I'm getting an error when I try to import dask.dot that it can't find the graphviz install. However, both graphviz and pygraphviz are installed. balter@exalab3:~$ conda install dask Fetching package metadata ........... Solving package…

python anaconda graphviz conda dask

asked Feb 02 '17 at 23:49

abalter

9,663
17
90
145

votes

1 answer

Repeated task execution using the distributed Dask scheduler

I'm using the Dask distributed scheduler, running a scheduler and 5 workers locally. I submit a list of delayed() tasks to compute(). When the number of tasks is say 20 (a number >> than the number of workers) and each task takes say at least 15…

python dask

asked Jan 31 '17 at 18:48

Daniel

votes

0 answers

Pathos, Dask, futures, which one to use for parallel cluster application?

I am confused here. I have an application that is CPU bounded so I went to implementing a parallelisation using multiprocess to overcome GIL issues. I first tried to use multiprocessing and futures but I faced a pickling issue so I went to pathos…

python parallel-processing multiprocessing dask pathos

asked Oct 21 '16 at 14:36

tupui

5,738
3
31
52

Prev 1 2 3

…

99 100 Next