Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Dask read_csv failes to read from BytesIO

I have the following code to read a gzipped csv file from bytes. It works with pandas.read_csv, however, it fails with dask (dd.read_csv). File in d['urls'][0] is a link to a file on Amazon S3 provided by a third-party service. import io import…
Porada Kev
  • 503
  • 11
  • 24
3
votes
1 answer

resample and groupby on big dask array with xarray - using map_blocks?

I have a custom workflow, that requires using resample to get to a higher temporal frequency, applying a ufunc, and groupby + mean to compute the final result. I would like to apply this to a big xarray dataset, which is backed by a chunked dask…
Val
  • 6,585
  • 5
  • 22
  • 52
3
votes
1 answer

Writing a dask dataframe to parquet: 'TypeError'

I am trying to use Dask to write parquet files. Target is to use its repartition feature, but it appears I am not able to write out a simple parquet file, without coming to the repartitionstep... Here is the code I use to create a parquet file from…
pierre_j
  • 895
  • 2
  • 11
  • 26
3
votes
0 answers

How to View Dask Daskboard in Dask Gateway when using a private IP address/VPC?

We deployed Dask Gateway on Kubernetes on Google Cloud Platform. We are currently using an internal TCP load balancer to expose the traefik proxy for security purposes. Our users are able to create a client connection to the cluster generated…
3
votes
0 answers

Dask - how to efficiently execute the right number of tasks

I am trying to mask and then apply a unique operation on one column. A simplified version of the code i am using is reported below: import numpy as np import pandas as pd import dask.dataframe as dd data = np.random.randint(0,100,(1000,2)) ddf =…
Guido Muscioni
  • 1,203
  • 3
  • 15
  • 37
3
votes
3 answers

Splitting 250GB JSON file containing multiple tables into parquet

I have a JSON file with the following exemplified format, { "Table1": { "Records": [ { "Key1Tab1": "SomeVal", "Key2Tab1": "AnotherVal" }, { "Key1Tab1":…
Nicolai Iversen
  • 349
  • 1
  • 4
  • 17
3
votes
2 answers

dask distributed: How to increase timeout for worker connections? connect() didn't finish in time

OSError: Timed out trying to connect to 'tcp://127.0.0.1:40475' after 10 s: Timed out trying to connect to 'tcp:// 8.56.11:40475' after 10 s: connect() didn't finish in time Having some huge operations running, I would like to increase the timeout…
gies0r
  • 4,723
  • 4
  • 39
  • 50
3
votes
0 answers

Dask Groupby Multi-index Level

I want to groupby dask multi-index data frame by its level. I want to do the following pandas equivalent in dask: df.groupby(level=0)['TARGET']\ .apply(lambda x: x.shift().rolling(min_periods=1, window=7).sum()).fillna(0)\ …
Krishnang K Dalal
  • 2,322
  • 9
  • 34
  • 55
3
votes
0 answers

Dask map_partitions return multiple output?

Background: train.csv datasets has over 100m records Tried on first 1m records I wrote two functions: 1, func1: apply to the partitions of the train, and return 1 new dataframe 2, func2: apply to the partitions of the train, and return 2 new…
Argos.LEE
  • 139
  • 2
  • 6
3
votes
1 answer

ImportError: No module named 'dask.dataframe';

I'm trying to run a standalone python script via "Anaconda Prompt". I keep getting the error. ImportError: No module named 'dask.dataframe'; I have installed pandas using conda install dask I have also installed dask via: python -m pip install…
Sql_Pete_Belfast
  • 570
  • 4
  • 23
3
votes
0 answers

Dask tasks failing because they timed out trying to connect

I am trying to perform some calculations on xarray data. The data has lat, lon and time coordinates, and multiple data variables. My calculation is performed on a single timestep. In an attempt to parralellize this I am using the dask distributed…
phrasper
  • 41
  • 4
3
votes
2 answers

Dask dataframe larger than memory

I'm new on Dask and I'm finding it quite useful, but I have a problem that I haven't been able to solve yet. I have a data set larger than memory, and I want to remove duplicate values from a column. The problem is that after this removal the data…
Klaifer Garcia
  • 340
  • 1
  • 12
3
votes
2 answers

Dask - Find duplicate values

I need to find duplicates in a column in a dask DataFrame. For pandas there is duplicated() method for this. Though in dask it is not supported. Q: What is the best way of getting all duplicated values in dask? My Idea: Make a column I'm checking as…
Vladislav Varslavans
  • 2,775
  • 4
  • 18
  • 33
3
votes
1 answer

How to config dask to connect to remote SSH server?

from dask.distributed import Client, SSHCluster cluster = SSHCluster(["localhost", "192.168.x.x"], connect_options={"known_hosts": None, "username": "xxxx", "client_keys": "~/.ssh/dask" }, worker_options={"nthreads": 2}, scheduler_options={"port":…
3
votes
1 answer

Growing memory usage (leak?) in Dask Distributed profiler

I have a longish running task that I submit to a Dask cluster (worker is running 1 process and 1 thread) and I use tracemalloc to track memory usage. The task can run long enough that memory usage builds up and has caused all sorts of problems. …
Alex P
  • 71
  • 5