Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

How to "reindex" with Dask DataFrame

I'm looking into using dask for time-series research with large volumes of data. One common operation that I use is realignment of data to a different index (the reindex operation on pandas dataframe's). I noticed that the reindex function is not…
John
  • 31
  • 3
3
votes
2 answers

Unable to use distribute LocalCluster in subprocess in python 3

I get an error when using distribute's LocalCluster in a subprocess with python 3 (python 2 works fine). I have the following minimal example (I am using python 3.6, distributed 1.23.3, tornado 5.1.1): import multiprocessing from distributed import…
Joerg
  • 669
  • 1
  • 6
  • 10
3
votes
1 answer

What threads do Dask Workers have active?

When running a Dask worker I notice that there are a few extra threads beyond what I was expecting. How many threads should I expect to see running from a Dask Worker and what are they doing?
MRocklin
  • 55,641
  • 23
  • 163
  • 235
3
votes
0 answers

split bigquery dataframe into chunks using dask

I searched and tested different ways to find if I can be able to split bigquery dataframe into chunks of 75 rows, but couldn't find a way to do so. here is the senario: I got a very large bigquery dataframe (millions of rows) using python and gcp…
MT467
  • 668
  • 2
  • 15
  • 31
3
votes
0 answers

Writing Dask/XArray to NetCDF - Parallel IO

I am using Dask/Xarray with a ~150 GB dataset on a distributed cluster on a HPC system. I have the computation component complete, which takes about ~30 minutes. I want to save the final result to a NETCDF4 file, but writing the data to a NETCDF…
Rowan_Gaffney
  • 452
  • 5
  • 17
3
votes
0 answers

Airflow + Dask: Can we specify resources?

How can one specify a resource like GPU for a dask-worker, and use this so airflow jobs that need such resource are allocated correctly ?
OddNorg
  • 868
  • 1
  • 6
  • 18
3
votes
1 answer

How should I write multiple CSV files efficiently using dask.dataframe?

Here is the summary of what I'm doing: At first, I do this by normal multiprocessing and pandas package: Step 1. Get the list of files name which I'm gonna to read import os files = os.listdir(DATA_PATH + product) Step 2. loop over the…
TianYu Jiang
  • 31
  • 1
  • 2
3
votes
1 answer

Using dask.bag vs normal python list?

When I run this parallel dask.bag code below, I seem to get much slower computation than the sequential Python code. Any insights into why? import dask.bag as db def is_even(x): return not x % 2 Dask code: %%timeit b =…
max
  • 4,141
  • 5
  • 26
  • 55
3
votes
1 answer

How to apply a function to multiple columns of a Dask Data Frame in parallel?

I have a Dask Dataframe for which I would like to compute skewness for a list of columns and if this skewness exceeds a certain threshold, I correct it using log transformation. I am wondering whether there is a more efficient way of making…
andersy005
  • 33
  • 1
  • 6
3
votes
1 answer

jupyter lab open an iframe on a tab for monitoring dask scheduler

I am developping with dask distributed and this package provides a very useful debugging view as a bokeh application. I want to have this application next to my notebook in a jupyterlab tab. I have managed to do so by opening the jupyter lab…
3
votes
0 answers

dask concat fails for unequal sized dataframes

I am experiencing a strange behavior when performing a concatenation of two dask dataframes (lazy objects) that have different number of columns/rows. The dataframes are read from hdf5 files using: df1 = dd.read_hdf( f1, 'hf', mode='r' ) the final…
Kostas Markakis
  • 143
  • 2
  • 11
3
votes
2 answers

Dask with cython in Juypter: ModuleNotFoundError: No module named '_cython_magic

I am getting: KilledWorker: ("('from_pandas-1445321946b8a22fc0ada720fb002544', 4)", 'tcp://127.0.0.1:45940') I've read the explanation about the latter error message, but this is all confusing coming together with the error message at the top of…
matanster
  • 15,072
  • 19
  • 88
  • 167
3
votes
1 answer

Dask DummyEncoder not returning all the columns

I tried using dask DummyEncoder for OneHotEncoding my data. But the results are not as expected. dask's DummyEncoder Example: from dask_ml.preprocessing import DummyEncoder import pandas as pd data = pd.DataFrame({ 'B': ['a', 'a',…
Asif Ali
  • 1,422
  • 2
  • 12
  • 28
3
votes
1 answer

Adding columns in a Dask DataFrame overload one worker

I'm trying Dask just for the fun of it, and grasp the good practice. After some try and error, I got the hand of Dask Array. Now with Dask DataFrame, I don't seem to be able to extend the DataFrame in a balanced distributed scheme. Here's an…
Megamini
  • 313
  • 2
  • 9
3
votes
2 answers

Element-wise operations of arrays of different size

What would be the fastest and most pythonic way to perform element-wise operations of arrays of different size without oversampling the smaller array? For example: I have a large array, A 1000x1000 and a small array B 10x10 I want each element in B…
user2821
  • 1,568
  • 2
  • 12
  • 16