Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

What is the Dask equivalent of the Pandas .filter() attribute?

I am trying to make sub-DataFrames from a larger DataFrame in Dask. I realize that a lot of the tools found in Pandas used to manipulate DataFrames are present in Dask, however the devs are very transparent about what is not. One such tool is the…

asked May 09 '18 at 19:52

Dave L

votes

2 answers

selecting few rows by index from dask dataframe?

df = dd.read_csv('csv',usecols=fields,skip_blank_lines=True) len(df.iloc[0:5]) The above code raises AttributeError: 'DataFrame' object has no attribute 'iloc' tried ix loc but unable select rows based on index

dask

asked Apr 30 '18 at 17:25

madnavs

votes

1 answer

Running shell commands in parallel using dask distributed

I have a folder with a lot of .sh scripts. How can I use an already set up dask distributed cluster to run them in parallel? Currently, I am doing the following: import dask, distributed, os # list with shell commands that I want to run commands =…

python dask dask-distributed

asked Mar 29 '18 at 08:53

Arco Bast

3,595
2
26
53

votes

0 answers

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time. Approach 1: call api read_parquet with glob path.…

dask dask-distributed fastparquet

asked Mar 22 '18 at 07:21

Santosh Kumar

votes

1 answer

Read tail by partition from CSV file with dask.dataframe

With Dash we can easily read CSV files and take first lines with head, even in multiple partitions. import dask.dataframe as dd df = dd.read_csv('data.csv').head(n=100, npartitions=2) But I would like to read last lines of my CSV file on multiple…

python pandas csv dataframe dask

asked Mar 14 '18 at 09:37

Thomas

1,164
13
41

votes

0 answers

Dask restart worker(s) using client

Is there a way using dask client to restart a worker or worker list provided. Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been changed by the execution. Client.restart() restarts entire…

dask dask-distributed

asked Mar 09 '18 at 16:23

Ameet Shah

votes

1 answer

Write numpy array to binary file efficiently

I need an efficient solution for writing a large amount of data to a binary file. Currently I use the numpy method .tofile, which consumes most of the runtime. My MWE: import numpy as np def writeCFloat(f, ndarray): np.asarray(ndarray,…

python pandas numpy binary dask

asked Mar 08 '18 at 12:02

Magdalena

votes

1 answer

Get new dataframe with only the latest rows per user

I have a big dataframe looking like this: Id last_item_bought time 'user1' 'bike' 2018-01-01 'user3' 'spoon' 2018-01-01 'user2' 'car' 2018-01-01 'user1' 'spoon' 2018-01-02 'user2' 'bike' 2018-01-02 'user3' 'paper' 2018-01-03 Each user has…

python-3.x pandas dask

asked Feb 28 '18 at 09:43

dennis-w

2,166
1
13
23

votes

1 answer

How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

how can I get all unique groups in Dask from grouped data frame? Let's say, we have the following code: g = df.groupby(['Year', 'Month', 'Day']) I have to iterate through all groups and process the data within the groups. My idea was to get all…

python dataframe dask

asked Feb 19 '18 at 16:31

qwertz1123

1,173
10
27

votes

1 answer

How to keep partitions after performing a group-by aggregation in dask

In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id. However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically…

python pandas dataframe distributed dask

asked Feb 16 '18 at 17:52

pygabriel

9,840
4
41
54

votes

1 answer

How to perform positional indexing in Python Dask dataframes

I've been working through the Dask Concurrent.futures documentation, and I'm having some trouble with the (outdated) Random Forest example. Specifically, the use of positional indexing to slice the dask dataframe into test/train splits: train =…

python pandas dataframe dask

asked Feb 14 '18 at 04:46

shellcat_zero

1,027
13
20

votes

1 answer

Using dask delayed to create dictionary values

I'm struggling to figure out how to get dask delayed to work on a particular workflow that involves creating a dictionary. The idea here is that func1, func2, func3 can run independently of each other at the same time, and I want the results of…

python dictionary dask dask-delayed

asked Feb 10 '18 at 18:40

blahblahblah

2,299
8
45
60

votes

1 answer

Redistribute dask tasks among the cluster

I am abusing dask as a task scheduler for long running tasks with map(, pure=False). So I am not interested in the dask graph, I just use dark as a way to distribute unix commands. Lets say if have 1000 tasks and they run for a week on a cluster of…

dask dask-distributed

asked Feb 06 '18 at 10:42

MaxBenChrist

votes

2 answers

How do I capture dask-worker console logs in a file?

In the below, I want to capture "dask_client_log_msg" and other task-logs in one file and "dask_worker_log_msg" and other client-logs in a separate file. As obviously client will run in a separate process altogether than the worker. So I need one…

dask dask-distributed dask-delayed

asked Feb 01 '18 at 10:35

TheCodeCache

votes

1 answer

compute very slow when processing large array

I'm trying to read in a 220 GB csv file with dask. Each line of this file has a name, a unique id, and the id of its parent. Each entry has multiple generations of parents, eventually I'd like to be able to reassemble the whole tree, but it's taking…

dask

asked Jan 26 '18 at 05:55

user9270849

Prev 1 2 3

…

99 100 Next