Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
5
votes
1 answer

Multiply chunked dask xarray by mask

I have a large (>100 GB) xarray Dataset holding weather forecast data (dimensions time, forecast step, latitude, longitude, with dask chunks over the time, latitude and longitude dimensions) and want to work out the average weather (for each time…
user7813790
  • 547
  • 1
  • 4
  • 12
5
votes
2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…
Jon.H
  • 794
  • 2
  • 9
  • 23
5
votes
3 answers

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this. All the dask-distributed examples/docs I see are populating the initial data load…
CoderOfTheNight
  • 944
  • 2
  • 8
  • 21
5
votes
1 answer

Trigger Dask workers to release memory

I'm distributing the computation of some functions using Dask. My general layout looks like this: from dask.distributed import Client, LocalCluster, as_completed cluster = LocalCluster(processes=config.use_dask_local_processes, …
gallamine
  • 865
  • 2
  • 12
  • 26
5
votes
1 answer

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows). what…
Talha Anwar
  • 2,699
  • 4
  • 23
  • 62
5
votes
1 answer

Many distributed dask workers idle after one evaluation, or never receive any work, when there are more task

We’re using dask to optimize deep-learner (DL) architectures by generating designs and then sending them to dask workers that, in turn, use pytorch for training. We observe some of the workers do not appear to start, and those that do complete…
Mark Coletti
  • 195
  • 11
5
votes
1 answer

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df)…
CFabry
  • 53
  • 2
  • 5
5
votes
1 answer

How to use client.scatter correctly and when in Dask

When executing a "large" number of tasks I am receiving this error: Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers And I also am getting a bunch of messages like…
muammar
  • 951
  • 2
  • 13
  • 32
5
votes
2 answers

Dask store/read a sparse matrix that doesn't fit in memory

I'm using sparse to construct, store, and read a large sparse matrix. I'd like to use Dask arrays to use its blocked algorithms features. Here's a simplified version of what I'm trying to do: file_path = './{}'.format('myfile.npz') if…
Diego Castillo
  • 325
  • 3
  • 11
5
votes
2 answers

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide. In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes). In my case, I have a 82GB file and 288 workers (12 physical nodes;…
jonathan
  • 91
  • 7
5
votes
1 answer

Why does Dask fill in "foo" and 1 in my Dataframe

I've read in around 15 csv files: df = dd.read_csv("gs://project/*.csv", blocksize=25e6, storage_options={'token': fs.session.credentials}) Then I persisted the Dataframe (it uses 7.33 GB memory): df = df.persist() I set a new…
Stanko
  • 4,275
  • 3
  • 23
  • 51
5
votes
1 answer

How to subset one row in dask.dataframe?

I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset…
5
votes
2 answers

Dask dataframe - split column into multiple rows based on delimiter

What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe: id var1 var2 1 A Z,Y 2 B X 3 C W,U,V I…
ltt
  • 417
  • 3
  • 12
5
votes
3 answers

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public. df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False}) This is not recommended for obvious reasons. How do I load the data…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
5
votes
1 answer

Get unique rows of dask array without using dask dataframe

Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame? I currently use this approach import dask.array as da import dask.dataframe as dd dx =…
Edgar H
  • 1,376
  • 2
  • 17
  • 31