Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to convert MultiIndex Pandas dataframe to Dask Dataframe

I am trying to convert a pandas dataframe that is MultiIndexed on two variables (an ID and a DateTime variable) to dask dataframe however I get the following error; "NotImplementedError: Dask does not support MultiIndex Dataframes" I am using the…

asked Jul 01 '19 at 23:16

Sher Afghan

votes

2 answers

How to use Dask on Databricks

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any…

dask databricks dask-distributed azure-databricks

asked Jun 04 '19 at 12:53

SARose

3,558
5
39
49

votes

0 answers

Pycharm debugger throws Bad file descriptor error when using dask distributed

I am using the most lightweight/simple dask multiprocessing which is the non-cluster local Client: from distributed import Client client = Client() Even so: the first instance of invoking dask.bag.compute() results in the following: Connected to…

python pycharm dask dask-distributed

asked May 23 '19 at 02:29

WestCoastProjects

58,982
91
316
560

votes

1 answer

sort dask dataframes by multiple columns some ascending, some descending

Im converting pandas to dask, main problem so far is sorting. For converting simple sorts Im using nlargest for complex sorting, like: df = df.sort_values( by=['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6',…

pandas sorting dataframe dask

asked May 09 '19 at 16:35

Carlos P Ceballos

votes

1 answer

Fast and efficient pandas groupby sum operation

I have a huge lod dataset of around 10 million rows and I have huge problems regarding performance and speed. I tried to use pandas, numpy(also using numba library) and dask. However I wasn't able to acchieve sufficient success. Raw Data (minimal…

python pandas numpy dask numba

asked May 03 '19 at 15:32

Mike_H

1,343
1
14
31

votes

1 answer

Str split with expand in Dask Dataframe

I have 34 million row and only have a column. I want to split string into 4 column. Here is my sample dataset (df): Log 0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648: 1 Apr 4 20:30:33 100.51.100.254…

python string split bigdata dask

asked Apr 22 '19 at 04:48

OctavianWR

votes

1 answer

dask.array.apply_gufunc with multiple outputs of different shapes

I am trying to apply a ufunc to chunked broadcastable dask arrays which produce several outputs of different shapes: import dask.array as da # dask.__version__ is 1.2.0 import numpy as np def func(A3, A2): return A3+A2, A2**2 A3 =…

dask python-xarray

asked Apr 17 '19 at 12:50

François

7,988
2
21
17

votes

1 answer

Merge multiple DataFrames

This question is referring to the previous post The solutions proposed worked very well for a smaller data set, here I'm manipulating with 7 .txt files with a total memory of 750 MB. Which shouldn't be too big, so I must be doing something wrong in…

python pandas dataframe dask

asked Apr 12 '19 at 16:41

PEBKAC

votes

0 answers

How do I get xarray.interp() to work in parallel?

I'm using xarray.interp on a large 3D DataArray (weather data: lat, lon, time) to map the values (wind speed) to new values based on a discrete mapping function f. The interpolation method seems to only utilise one core for computation, making the…

dask python-xarray

asked Apr 05 '19 at 18:12

euronion

1,142
6
14

votes

0 answers

Efficient dask method for reading sql table with where clause for 5 million rows

I have a 55-million-row table in MSSQL and I only need 5 million of those rows to pull into a dask dataframe. Currently, it doesn't support sql queries but it does support sqlalchemy statements, but there's some issue with that as described here:…

python sqlalchemy dask

asked Mar 28 '19 at 17:35

msolomon87

votes

1 answer

How to do groupby filter in Dask

I am attempting to take a dask dataframe, group by column 'A' and remove the groups where there are fewer than MIN_SAMPLE_COUNT rows. For example, the following code works in pandas: import pandas as pd import dask as da MIN_SAMPLE_COUNT = 1 x =…

dask

asked Mar 21 '19 at 21:20

user1549

votes

1 answer

NotImplementedError is thrown when I use isin with Dask data frames

Let's say I have two dask data frames: import dask.dataframe as dd import pandas as pd dd_1 = dd.from_pandas(pd.DataFrame({'a': [1, 2,3], 'b': [6, 7, 8]}), npartitions=1) dd_2 = dd.from_pandas(pd.DataFrame({'a': [1, 2, 5], 'b': [3, 7, 1]}),…

python dask

asked Mar 19 '19 at 17:58

amarchin

2,044
1
16
32

votes

3 answers

Dask add_done_callback with other args?

I'm looking to add a callback to a future once it is finished. Per the documentation: Call callback on future when callback has finished. The callback fn should take the future as its only argument. This will be called regardless of if the future…

python distributed-computing dask

asked Mar 08 '19 at 21:29

wolfblade87

votes

2 answers

Writing huge dask dataframes to parquet fails out of memory

I am basically converting some csv files to parquet. To do so, I decided to use dask, read the csv on dask and write it back to parquet. I am using a big blocksize as the customer requested (500 MB). The csv's are 15 GB and bigger (until 50 GB), the…

python parquet dask

asked Mar 04 '19 at 07:44

Jonatan Aponte

votes

2 answers

How to update the shape, chunks and chunksize metadata of a dask array with nan dimensions

Suppose I generate an array with a shape that depends on some computation, such as: >>> import dask.array as da >>> a = da.random.normal(size=(int(1e6), 10)) >>> a = a[a.mean(axis=1) > 0] >>> a.shape (nan, 10) >>> a.chunks ((nan, nan, nan, nan,…

python dask

asked Feb 28 '19 at 14:45

ogrisel

39,309
12
116
125

Prev 1 2 3

…

99 100 Next