Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

How to convert MultiIndex Pandas dataframe to Dask Dataframe

I am trying to convert a pandas dataframe that is MultiIndexed on two variables (an ID and a DateTime variable) to dask dataframe however I get the following error; "NotImplementedError: Dask does not support MultiIndex Dataframes" I am using the…
Sher Afghan
  • 101
  • 1
  • 11
4
votes
2 answers

How to use Dask on Databricks

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any…
SARose
  • 3,558
  • 5
  • 39
  • 49
4
votes
0 answers

Pycharm debugger throws Bad file descriptor error when using dask distributed

I am using the most lightweight/simple dask multiprocessing which is the non-cluster local Client: from distributed import Client client = Client() Even so: the first instance of invoking dask.bag.compute() results in the following: Connected to…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
4
votes
1 answer

sort dask dataframes by multiple columns some ascending, some descending

Im converting pandas to dask, main problem so far is sorting. For converting simple sorts Im using nlargest for complex sorting, like: df = df.sort_values( by=['column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6',…
Carlos P Ceballos
  • 384
  • 1
  • 7
  • 20
4
votes
1 answer

Fast and efficient pandas groupby sum operation

I have a huge lod dataset of around 10 million rows and I have huge problems regarding performance and speed. I tried to use pandas, numpy(also using numba library) and dask. However I wasn't able to acchieve sufficient success. Raw Data (minimal…
Mike_H
  • 1,343
  • 1
  • 14
  • 31
4
votes
1 answer

Str split with expand in Dask Dataframe

I have 34 million row and only have a column. I want to split string into 4 column. Here is my sample dataset (df): Log 0 Apr 4 20:30:33 100.51.100.254 dns,packet user: --- got query from 10.5.14.243:30648: 1 Apr 4 20:30:33 100.51.100.254…
OctavianWR
  • 217
  • 1
  • 16
4
votes
1 answer

dask.array.apply_gufunc with multiple outputs of different shapes

I am trying to apply a ufunc to chunked broadcastable dask arrays which produce several outputs of different shapes: import dask.array as da # dask.__version__ is 1.2.0 import numpy as np def func(A3, A2): return A3+A2, A2**2 A3 =…
François
  • 7,988
  • 2
  • 21
  • 17
4
votes
1 answer

Merge multiple DataFrames

This question is referring to the previous post The solutions proposed worked very well for a smaller data set, here I'm manipulating with 7 .txt files with a total memory of 750 MB. Which shouldn't be too big, so I must be doing something wrong in…
PEBKAC
  • 748
  • 1
  • 9
  • 28
4
votes
0 answers

How do I get xarray.interp() to work in parallel?

I'm using xarray.interp on a large 3D DataArray (weather data: lat, lon, time) to map the values (wind speed) to new values based on a discrete mapping function f. The interpolation method seems to only utilise one core for computation, making the…
euronion
  • 1,142
  • 6
  • 14
4
votes
0 answers

Efficient dask method for reading sql table with where clause for 5 million rows

I have a 55-million-row table in MSSQL and I only need 5 million of those rows to pull into a dask dataframe. Currently, it doesn't support sql queries but it does support sqlalchemy statements, but there's some issue with that as described here:…
msolomon87
  • 56
  • 1
  • 6
4
votes
1 answer

How to do groupby filter in Dask

I am attempting to take a dask dataframe, group by column 'A' and remove the groups where there are fewer than MIN_SAMPLE_COUNT rows. For example, the following code works in pandas: import pandas as pd import dask as da MIN_SAMPLE_COUNT = 1 x =…
user1549
  • 63
  • 6
4
votes
1 answer

NotImplementedError is thrown when I use isin with Dask data frames

Let's say I have two dask data frames: import dask.dataframe as dd import pandas as pd dd_1 = dd.from_pandas(pd.DataFrame({'a': [1, 2,3], 'b': [6, 7, 8]}), npartitions=1) dd_2 = dd.from_pandas(pd.DataFrame({'a': [1, 2, 5], 'b': [3, 7, 1]}),…
amarchin
  • 2,044
  • 1
  • 16
  • 32
4
votes
3 answers

Dask add_done_callback with other args?

I'm looking to add a callback to a future once it is finished. Per the documentation: Call callback on future when callback has finished. The callback fn should take the future as its only argument. This will be called regardless of if the future…
wolfblade87
  • 173
  • 10
4
votes
2 answers

Writing huge dask dataframes to parquet fails out of memory

I am basically converting some csv files to parquet. To do so, I decided to use dask, read the csv on dask and write it back to parquet. I am using a big blocksize as the customer requested (500 MB). The csv's are 15 GB and bigger (until 50 GB), the…
Jonatan Aponte
  • 429
  • 1
  • 5
  • 10
4
votes
2 answers

How to update the shape, chunks and chunksize metadata of a dask array with nan dimensions

Suppose I generate an array with a shape that depends on some computation, such as: >>> import dask.array as da >>> a = da.random.normal(size=(int(1e6), 10)) >>> a = a[a.mean(axis=1) > 0] >>> a.shape (nan, 10) >>> a.chunks ((nan, nan, nan, nan,…
ogrisel
  • 39,309
  • 12
  • 116
  • 125