Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
2 answers

Computation on sample of dask dataframe takes much longer than on all the data

I have a dask dataframe, backed by parquet. It's 131million rows, when I do some basic operations on the whole frame they take a couple of minutes. df = dd.read_parquet('data_*.pqt') unique_locations = df.location.unique() https =…
birdsarah
  • 1,165
  • 8
  • 20
3
votes
2 answers

Unable to install dask[complete]

I am trying to install dask[complete] via pip on Mac OSX, but I am always getting no matches found: dask[complete]. What is the best way to install dask[complete] library on Mac OSX? pip install dask[complete] zsh: no matches found: dask[complete]
thotam
  • 941
  • 2
  • 16
  • 31
3
votes
1 answer

Vectorized method to format a column of integers into specified-length strings in both pandas dataframe and dask dataframe

I have a pandas Dataframe: date time user_id 0 20160921 5947 13079492369730773513 1 20160921 5948 13079492369730773513 2 20160921 235949 13079492369730773513 3 20160921 235950 13079492369730773513 4 20160921 …
RottenIvy
  • 63
  • 3
3
votes
1 answer

Dask is slow with many disk-read and disk-write blocks showing up in the status page

My Dask computation is slow. When I look at the status page of the diagnostics dashboard I see that most of the time is spent in disk-read-* and disk-write-* tasks. What does this mean? How do I diagnose this issue?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
3
votes
1 answer

Directly running a task on a dedicated dask worker

A simple code-snippet is as follows: comment followed by ### is important.. from dask.distributed import Client ### this code-piece will get executed on a dask worker. def task_to_perform(): print("task in progress.") ## do something…
TheCodeCache
  • 820
  • 1
  • 7
  • 27
3
votes
1 answer

How do I convert a list of Pandas futures to a Dask Dataframe?

I have a list of Dask futures that point to Pandas dataframes: from dask.dataframe import Client client = Client() import pandas futures = client.map(pd.read_csv, filenames) How do I convert these to a Dask dataframe? note, I know that…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
3
votes
0 answers

Dask: isin with further use of index to another dask dataframe

The order of row.txt.gz and matrix.txt.gz files is identical. My purpose is to extract by some rows from dask dataframe from 'row.txt.gz' and then extract rows from matrix.txt.gz using exactly the same index. # ROWS rows =…
chupvl
  • 1,258
  • 2
  • 12
  • 20
3
votes
2 answers

Store a Dask DataFrame as a pickle

I have a Dask DataFrame constructed as follows: import dask.dataframe as dd df = dd.read_csv('matrix.txt', header=None) type(df) //dask.dataframe.core.DataFrame Is there way to save this DataFrame as a pickle? For…
Arjun
  • 817
  • 3
  • 16
  • 28
3
votes
1 answer

How can I compare two large CSV files using Dask

I am having two CSV files(approx 4GB each) and I want to check the difference between the entries of these two files. Suppose Row1 entries in 1.csv doesn't match with row1 of 2.csv but identical to row 100 of 2.csv then it shouldn't show any…
Saikat
  • 403
  • 1
  • 7
  • 19
3
votes
1 answer

How to read a single parquet file from s3 into a dask dataframe?

I'm trying to read a single parquet file with snappy compression from s3 into a Dask Dataframe. There is no metadata directory, since this file was written using Spark 2.1 It does not work locally with fastparquet import dask.dataframe as…
arinarmo
  • 375
  • 1
  • 11
3
votes
0 answers

how to combine dask and classes?

I am trying to rewrite an entire project that has been developed with classes. Little by little, the heaviest computational chunks should be parallelized, clearly we have a lot of independent sequential loops. An example with classes that mimicks…
Sergio Lucero
  • 862
  • 1
  • 12
  • 21
3
votes
1 answer

Iterate sequentially over a dask bag

I need to submit the elements of a very large dask.bag to a non-threadsafe store, ie I need something like for x in dbag: store.add(x) I can not use compute since the bag is to large to fit in memory. I need something more like…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
3
votes
1 answer

Set partitions on existing index in Dask dataframe

If I have an already indexed Dask dataframe with >>> A.divisions (None, None) >>> A.npartitions 1 and I want to set the divisions, so far I'm doing A.reset_index().set_index("index", divisions=sorted(divisions)) because…
astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
3
votes
1 answer

With dask-distributed how to generate futures from long running tasks fed by queues

I'm using a disk-distributed long running task along the lines of this example http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow where a long running worker task gets its inputs from a queue as in the tensorflow example and delivers…
3
votes
1 answer

Using dask delayed with functions returning lists

I am trying to use dask.delayed to build up a task graph. This mostly works quite nicely, but I regularly run into situations like this, where I have a number of delayed objects that have a method returning a list of objects of a length that is not…
tt293
  • 500
  • 4
  • 14