Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Multiply chunked dask xarray by mask

I have a large (>100 GB) xarray Dataset holding weather forecast data (dimensions time, forecast step, latitude, longitude, with dask chunks over the time, latitude and longitude dimensions) and want to work out the average weather (for each time…

python dask python-xarray

asked Jul 29 '19 at 15:08

user7813790

votes

2 answers

Reading Parquet File with Array> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array>'). An example of the df would be: import pandas as pd df = pd.DataFrame.from_records([ (1,…

python dask python-3.7 pyarrow fastparquet

asked Jul 14 '19 at 02:06

Jon.H

votes

3 answers

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this. All the dask-distributed examples/docs I see are populating the initial data load…

dask dask-distributed

asked May 16 '19 at 18:06

CoderOfTheNight

votes

1 answer

Trigger Dask workers to release memory

I'm distributing the computation of some functions using Dask. My general layout looks like this: from dask.distributed import Client, LocalCluster, as_completed cluster = LocalCluster(processes=config.use_dask_local_processes, …

dask dask-distributed

asked Apr 30 '19 at 15:03

gallamine

votes

1 answer

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows). what…

python pandas dask

asked Apr 13 '19 at 11:56

Talha Anwar

2,699
4
23
62

votes

1 answer

Many distributed dask workers idle after one evaluation, or never receive any work, when there are more task

We’re using dask to optimize deep-learner (DL) architectures by generating designs and then sending them to dask workers that, in turn, use pytorch for training. We observe some of the workers do not appear to start, and those that do complete…

python dask dask-distributed

asked Apr 06 '19 at 01:54

Mark Coletti

votes

1 answer

convert dask.bag of dictionaries to dask.dataframe using dask.delayed and pandas.DataFrame

I am struggling to convert a dask.bag of dictionaries into dask.delayed pandas.DataFrames into a final dask.dataframe I have one function (make_dict) that reads files into a rather complex nested dictionary structure and another function (make_df)…

dask dask-delayed

asked Mar 22 '19 at 11:15

CFabry

votes

1 answer

How to use client.scatter correctly and when in Dask

When executing a "large" number of tasks I am receiving this error: Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers And I also am getting a bunch of messages like…

python-3.x parallel-processing dask dask-distributed

asked Mar 20 '19 at 00:01

muammar

votes

2 answers

Dask store/read a sparse matrix that doesn't fit in memory

I'm using sparse to construct, store, and read a large sparse matrix. I'd like to use Dask arrays to use its blocked algorithms features. Here's a simplified version of what I'm trying to do: file_path = './{}'.format('myfile.npz') if…

python numpy sparse-matrix dask

asked Feb 20 '19 at 14:15

Diego Castillo

votes

2 answers

Dask Distributed: Reading .csv from HDFS

I'm performance testing Dask using "Distributed Pandas on a Cluster with Dask DataFrames" as a guide. In Matthew's example, he has a 20GB file and 64 workers (8 physical nodes). In my case, I have a 82GB file and 288 workers (12 physical nodes;…

python hdfs dask dask-distributed

asked Feb 17 '19 at 03:03

jonathan

votes

1 answer

Why does Dask fill in "foo" and 1 in my Dataframe

I've read in around 15 csv files: df = dd.read_csv("gs://project/*.csv", blocksize=25e6, storage_options={'token': fs.session.credentials}) Then I persisted the Dataframe (it uses 7.33 GB memory): df = df.persist() I set a new…

dataframe dask dask-distributed

asked Feb 08 '19 at 11:39

Stanko

4,275
3
23
51

votes

1 answer

How to subset one row in dask.dataframe?

I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset…

python dataframe subset dask

asked Feb 02 '19 at 02:57

Kornpob Bhirombhakdi

votes

2 answers

Dask dataframe - split column into multiple rows based on delimiter

What is an efficient way of splitting a column into multiple rows using dask dataframe? For example, let's say I have a csv file which I read using dask to produce the following dask dataframe: id var1 var2 1 A Z,Y 2 B X 3 C W,U,V I…

python pandas performance dask

asked Jan 19 '19 at 22:32

ltt

votes

3 answers

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public. df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False}) This is not recommended for obvious reasons. How do I load the data…

python dask dask-distributed

asked Jan 14 '19 at 08:06

shantanuo

31,689
78
245
403

votes

1 answer

Get unique rows of dask array without using dask dataframe

Is there a way of getting unique rows of a dask array that is larger than the available memory? Ideally, without converting it to a dask DataFrame? I currently use this approach import dask.array as da import dask.dataframe as dd dx =…

python numpy dask

asked Nov 20 '18 at 09:05

Edgar H

1,376
2
17
31

Prev 1 2 3

…

99 100 Next