Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

What is the Dask equivalent of the Pandas .filter() attribute?

I am trying to make sub-DataFrames from a larger DataFrame in Dask. I realize that a lot of the tools found in Pandas used to manipulate DataFrames are present in Dask, however the devs are very transparent about what is not. One such tool is the…
Dave L
  • 77
  • 11
4
votes
2 answers

selecting few rows by index from dask dataframe?

df = dd.read_csv('csv',usecols=fields,skip_blank_lines=True) len(df.iloc[0:5]) The above code raises AttributeError: 'DataFrame' object has no attribute 'iloc' tried ix loc but unable select rows based on index
madnavs
  • 137
  • 2
  • 8
4
votes
1 answer

Running shell commands in parallel using dask distributed

I have a folder with a lot of .sh scripts. How can I use an already set up dask distributed cluster to run them in parallel? Currently, I am doing the following: import dask, distributed, os # list with shell commands that I want to run commands =…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
4
votes
0 answers

Optimal approach to create dask dataframe from parquet files(HDFS) in different directories

I am trying to create dask dataframe from large number of parquet files stored different HDFS directories. I have tried two approaches but both of them seems to take very long time. Approach 1: call api read_parquet with glob path.…
Santosh Kumar
  • 761
  • 5
  • 28
4
votes
1 answer

Read tail by partition from CSV file with dask.dataframe

With Dash we can easily read CSV files and take first lines with head, even in multiple partitions. import dask.dataframe as dd df = dd.read_csv('data.csv').head(n=100, npartitions=2) But I would like to read last lines of my CSV file on multiple…
Thomas
  • 1,164
  • 13
  • 41
4
votes
0 answers

Dask restart worker(s) using client

Is there a way using dask client to restart a worker or worker list provided. Needed a way to bounce a worker after a task is executed to reset the state of the process which may have been changed by the execution. Client.restart() restarts entire…
Ameet Shah
  • 61
  • 1
  • 4
4
votes
1 answer

Write numpy array to binary file efficiently

I need an efficient solution for writing a large amount of data to a binary file. Currently I use the numpy method .tofile, which consumes most of the runtime. My MWE: import numpy as np def writeCFloat(f, ndarray): np.asarray(ndarray,…
Magdalena
  • 45
  • 1
  • 5
4
votes
1 answer

Get new dataframe with only the latest rows per user

I have a big dataframe looking like this: Id last_item_bought time 'user1' 'bike' 2018-01-01 'user3' 'spoon' 2018-01-01 'user2' 'car' 2018-01-01 'user1' 'spoon' 2018-01-02 'user2' 'bike' 2018-01-02 'user3' 'paper' 2018-01-03 Each user has…
dennis-w
  • 2,166
  • 1
  • 13
  • 23
4
votes
1 answer

How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

how can I get all unique groups in Dask from grouped data frame? Let's say, we have the following code: g = df.groupby(['Year', 'Month', 'Day']) I have to iterate through all groups and process the data within the groups. My idea was to get all…
qwertz1123
  • 1,173
  • 10
  • 27
4
votes
1 answer

How to keep partitions after performing a group-by aggregation in dask

In my application I perform an aggregation on a dask dataframe using groupby, ordered by a certain id. However I would like that the aggregation maintains the partition divisions, as I intend to perform joins with other dataframe identically…
pygabriel
  • 9,840
  • 4
  • 41
  • 54
4
votes
1 answer

How to perform positional indexing in Python Dask dataframes

I've been working through the Dask Concurrent.futures documentation, and I'm having some trouble with the (outdated) Random Forest example. Specifically, the use of positional indexing to slice the dask dataframe into test/train splits: train =…
shellcat_zero
  • 1,027
  • 13
  • 20
4
votes
1 answer

Using dask delayed to create dictionary values

I'm struggling to figure out how to get dask delayed to work on a particular workflow that involves creating a dictionary. The idea here is that func1, func2, func3 can run independently of each other at the same time, and I want the results of…
blahblahblah
  • 2,299
  • 8
  • 45
  • 60
4
votes
1 answer

Redistribute dask tasks among the cluster

I am abusing dask as a task scheduler for long running tasks with map(, pure=False). So I am not interested in the dask graph, I just use dark as a way to distribute unix commands. Lets say if have 1000 tasks and they run for a week on a cluster of…
MaxBenChrist
  • 547
  • 3
  • 9
4
votes
2 answers

How do I capture dask-worker console logs in a file?

In the below, I want to capture "dask_client_log_msg" and other task-logs in one file and "dask_worker_log_msg" and other client-logs in a separate file. As obviously client will run in a separate process altogether than the worker. So I need one…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
4
votes
1 answer

compute very slow when processing large array

I'm trying to read in a 220 GB csv file with dask. Each line of this file has a name, a unique id, and the id of its parent. Each entry has multiple generations of parents, eventually I'd like to be able to reassemble the whole tree, but it's taking…