Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to "reindex" with Dask DataFrame

I'm looking into using dask for time-series research with large volumes of data. One common operation that I use is realignment of data to a different index (the reindex operation on pandas dataframe's). I noticed that the reindex function is not…

dask

asked Oct 29 '18 at 06:34

John

votes

2 answers

Unable to use distribute LocalCluster in subprocess in python 3

I get an error when using distribute's LocalCluster in a subprocess with python 3 (python 2 works fine). I have the following minimal example (I am using python 3.6, distributed 1.23.3, tornado 5.1.1): import multiprocessing from distributed import…

asked Oct 25 '18 at 11:24

Joerg

votes

1 answer

What threads do Dask Workers have active?

When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing?

dask

asked Oct 03 '18 at 01:22

MRocklin

55,641
23
163
235

votes

0 answers

split bigquery dataframe into chunks using dask

I searched and tested different ways to find if I can be able to split bigquery dataframe into chunks of 75 rows, but couldn't find a way to do so. here is the senario: I got a very large bigquery dataframe (millions of rows) using python and gcp…

python numpy google-bigquery dask dask-distributed

asked Sep 27 '18 at 17:34

MT467

votes

0 answers

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF…

dask netcdf python-xarray dask-distributed netcdf4

asked Sep 26 '18 at 13:31

Rowan_Gaffney

votes

0 answers

Airflow + Dask: Can we specify resources?

How can one specify a resource like GPU for a dask-worker, and use this so airflow jobs that need such resource are allocated correctly ?

python airflow dask airflow-scheduler

asked Sep 21 '18 at 18:43

OddNorg

votes

1 answer

How should I write multiple CSV files efficiently using dask.dataframe?

Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the…

export-to-csv dask dask-delayed

asked Sep 15 '18 at 06:27

TianYu Jiang

votes

1 answer

Using dask.bag vs normal python list?

When I run this parallel dask.bag code below, I seem to get much slower computation than the sequential Python code. Any insights into why? import dask.bag as db def is_even(x): return not x % 2 Dask code: %%timeit b =…

python parallel-processing dask

asked Sep 11 '18 at 19:56

max

4,141
5
26
55

votes

1 answer

How to apply a function to multiple columns of a Dask Data Frame in parallel?

I have a Dask Dataframe for which I would like to compute skewness for a list of columns and if this skewness exceeds a certain threshold, I correct it using log transformation. I am wondering whether there is a more efficient way of making…

python parallel-processing dask

asked Aug 31 '18 at 14:05

andersy005

votes

1 answer

jupyter lab open an iframe on a tab for monitoring dask scheduler

I am developping with dask distributed and this package provides a very useful debugging view as a bokeh application. I want to have this application next to my notebook in a jupyterlab tab. I have managed to do so by opening the jupyter lab…

jupyter-notebook jupyter dask dask-distributed jupyter-lab

asked Aug 20 '18 at 21:22

redoules

votes

0 answers

dask concat fails for unequal sized dataframes

I am experiencing a strange behavior when performing a concatenation of two dask dataframes (lazy objects) that have different number of columns/rows. The dataframes are read from hdf5 files using: df1 = dd.read_hdf( f1, 'hf', mode='r' ) the final…

python-3.x concatenation dask

asked Aug 20 '18 at 13:37

Kostas Markakis

votes

2 answers

Dask with cython in Juypter: ModuleNotFoundError: No module named '_cython_magic

I am getting: KilledWorker: ("('from_pandas-1445321946b8a22fc0ada720fb002544', 4)", 'tcp://127.0.0.1:45940') I've read the explanation about the latter error message, but this is all confusing coming together with the error message at the top of…

cython dask

asked Aug 18 '18 at 19:44

matanster

15,072
19
88
167

votes

1 answer

Dask DummyEncoder not returning all the columns

I tried using dask DummyEncoder for OneHotEncoding my data. But the results are not as expected. dask's DummyEncoder Example: from dask_ml.preprocessing import DummyEncoder import pandas as pd data = pd.DataFrame({ 'B': ['a', 'a',…

python pandas dask one-hot-encoding

asked Aug 15 '18 at 16:38

Asif Ali

1,422
2
12
28

votes

1 answer

Adding columns in a Dask DataFrame overload one worker

I'm trying Dask just for the fun of it, and grasp the good practice. After some try and error, I got the hand of Dask Array. Now with Dask DataFrame, I don't seem to be able to extend the DataFrame in a balanced distributed scheme. Here's an…

python dask

asked Aug 13 '18 at 12:42

Megamini

votes

2 answers

Element-wise operations of arrays of different size

What would be the fastest and most pythonic way to perform element-wise operations of arrays of different size without oversampling the smaller array? For example: I have a large array, A 1000x1000 and a small array B 10x10 I want each element in B…

python arrays numpy dask python-xarray

asked Aug 12 '18 at 13:00

user2821

1,568
2
12
16

Prev 1 2 3

…

99 100 Next