Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

3 answers

Merging pandas Data frames uses way too much memory

I'm working on this Kaggle competition as the final project for the course I'm taking, and for that, I was trying to replicate this notebook but there is a function he uses to get the lagged features that is just using way too much memory for me.…

asked Oct 07 '18 at 22:30

João Areias

1,192
11
41

votes

1 answer

Truth of Delayed objects is not Supported

I'm using dask to delay computation of some functions that return series in my code-base. Most operations seem to behave as expected so far - apart from my use of np.average. The function I have returns a pd.Series which I then want to compute a…

python dask

asked Oct 04 '18 at 12:04

freebie

2,161
2
19
36

votes

1 answer

Can't convert dataframe column data types

After processing a big data set using Pandas/Dask, I saved the resulting data frame to a csv file. When I try to read the output CSV using Dask, the data types are all objects by default. Whenever I try to convert them using conventional methods…

python pandas type-conversion dask data-conversion

asked Sep 19 '18 at 16:48

GRoutar

1,311
1
15
38

votes

1 answer

Tornado unexpected exception in Future after timeout

I have set up a dask cluster. I can access a web dashboard, but when I'm trying to connect to the scheduler: from dask.distributed import Client client = Client('192.168.0.10:8786') I get the following error: tornado.application - ERROR - Exception…

python-3.x tornado dask concurrent.futures dask-distributed

asked Sep 13 '18 at 18:04

Vladyslav Moisieienkov

4,118
4
25
32

votes

1 answer

Efficient way to stack Dask Arrays generated from Xarray

So I am trying to read a large amount of relatively large netCDF files containing hydrologic data. The NetCDF files all look like this: Dimensions: (feature_id: 2729077, reference_time: 1, time: 1) Coordinates: * time …

python dask netcdf python-xarray

asked Sep 12 '18 at 16:40

pythonweb

1,024
2
11
26

votes

3 answers

Importing a Dask dataframe gives an error cannot import name 'is_datetime64tz_dtype'

I installed Dask in my Jupyter notebook using the below command !pip install “dask[complete]” After this when I run the import command import dask.dataframe as dd I get the below error. ImportError Traceback (most…

pandas jupyter data-science dask

asked Sep 10 '18 at 04:30

Mallikarjun S B

votes

2 answers

Efficient pairwise comparison of rows in pandas DataFrame

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory. What I would like to do is compare each row with other…

python pandas pandas-groupby dask

asked Aug 13 '18 at 22:00

Fred Byrd

votes

1 answer

write dask dataframe to one file

I can write a massive dask data frame to disk like so: raw_data.to_csv(r'C:\Bla\SubFolder\*.csv') This produces chunked data of the original (massaged) dataset in the subfolder: C:\Bla\SubFolder\ Just wondering, can I force dask to write the data…

python python-3.x pandas csv dask

asked Aug 09 '18 at 13:16

cs0815

16,751
45
136
299

votes

1 answer

Dask DataFrame .head() very slow after indexing

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing? import dask.dataframe as dd df = dd.read_parquet("Filepath") df.head() # takes 10 seconds df = df.set_index('id') df.head() # takes 10 minutes +

dask

asked Jul 29 '18 at 20:53

AZhao

13,617
7
31
54

votes

1 answer

Switching from multiprocess to multithreaded Dask.DataFrame

I have a question about using dask to parallelize my code. I have a pandas dataframe and 8 cores CPU. So I want apply some function row-wise. Here is example: import dask.dataframe as dd from dask.multiprocessing import get # o - is pandas…

python multithreading pandas dataframe dask

asked Jul 06 '18 at 03:22

zhc

votes

1 answer

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I…

pandas dataframe dask dask-distributed dask-delayed

asked Jun 28 '18 at 09:14

PhaKuDi

votes

3 answers

dask read_parquet with pyarrow memory blow up

I am using dask to write and read parquet. I am writing using fastparquet engine and reading using pyarrow engine. My worker has 1 gb of memory. With fastparquet the memory usage is fine, but when i switch to pyarrow, it just blows up and causes the…

dask pyarrow fastparquet

asked Jun 15 '18 at 10:13

pranav kohli

votes

1 answer

Creating a dask bag from a generator

I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(generator) for generator in list_of_generators] my_bag =…

python dask dask-delayed

asked Jun 14 '18 at 16:34

danodonovan

19,636
10
70
78

votes

2 answers

Including keyword arguments (kwargs) in custom Dask graphs

Am building a custom graph for one operation with Dask. Am familiar with how to pass arguments to a function in Dask graph and have read up on the docs. However still seem to be missing something. One of the functions used in the Dask graph takes…

python dask

asked Jun 14 '18 at 16:17

jakirkham

votes

3 answers

parallelize conversion of a single 16M row csv to Parquet with dask

The following operation works, but takes nearly 2h: from dask import dataframe as ddf ddf.read_csv('data.csv').to_parquet('data.pq') Is there a way to parallelize this? The file data.csv is ~2G uncompressed with 16 million rows by 22 columns.

python csv dataframe parquet dask

asked May 24 '18 at 14:05

Daniel Mahler

7,653
5
51
90

Prev 1 2 3

…

99 100 Next