Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
4
votes
1 answer

memory error in dask when using dummy encoder

I am in the process of going to dummy encode a dask dataframe train_final[categorical_var]. However, when I run the code I get a memory error. Could this happen since dask is supposed to do it by loading data chunk by chunk. The code is below: from…
4
votes
4 answers

Dask read csv versus pandas read csv

I have the following problem. I have a huge csv file and want to load it with multiprocessing. Pandas needs 19 seconds for a example file with 500000 rows and 130 colums with different dtypes. I tried dask because i want to multiprocess the reading.…
Varlor
  • 1,421
  • 3
  • 22
  • 46
4
votes
1 answer

How to merge dataframes with dask without running out of memory?

Merging multiple dask dataframes crashes my computer. Hi, I am attempting to merge a long list of csv files with dask. Each csv file contains a list of timestamps when a variable has changed its value, together with the value; e.g. for variable1…
4
votes
1 answer

Initializing state on dask-distributed workers

I am trying to do something like resource = MyResource() def fn(x): something = dosemthing(x, resource) return something client = Client() results = client.map(fn, data) The issue is that resource is not serializable and is expensive to…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
4
votes
1 answer

What is the default number of workers in a dask compute?

@delayed def do_something(): # Does some work pass futures = [do_something() for x in range(100)] compute(*futures) does the default number of workers depend on our cpu cores? or does it run all the 100 in parallel (i assume this is not…
Geethanadh
  • 313
  • 5
  • 17
4
votes
1 answer

Purpose of compute() in Dask

What're the logistics behind having the extra .compute() in the numpy and pandas mimicked functionalities? Is it just to support some kind of lazy evaluation? Example from Dask documentation below: import pandas as pd import…
Ryan McCormick
  • 246
  • 4
  • 14
4
votes
1 answer

Python Dask dataframe separation based on column value

I'm a complete newbie to python dask (a little experience with pandas). I have a large Dask Dataframe (~10 to 20 million rows) that I have to separate based on a unique column value. For exmaple if I have the following Dataframe with column C1 to…
pichlbaer
  • 923
  • 1
  • 10
  • 18
4
votes
1 answer

dask read_csv timeout on Amazon s3 with big files

dask read_csv timeout on s3 for big files s3fs.S3FileSystem.read_timeout = 5184000 # one day s3fs.S3FileSystem.connect_timeout = 5184000 # one day client = Client('a_remote_scheduler_ip_here:8786') df =…
4
votes
1 answer

How to achieve `groupby` rolling mean in dask?

I have a dataframe, and I want to groupby some attributes and calculate the rolling mean of a numerical column in Dask. I know there is no implementation in Dask for groupby rolling but I read an SO question which shows it was possible. Dask rolling…
pissall
  • 7,109
  • 2
  • 25
  • 45
4
votes
1 answer

dask: How do I avoid timeout for a task?

In my dask-based application (using the distributed scheduler), I'm seeing failures that start with this error text: tornado.application - ERROR - Exception in Future after timeout Traceback (most recent call last): File…
Stuart Berg
  • 17,026
  • 12
  • 67
  • 99
4
votes
1 answer

How do I use dask to efficiently calculate many simple statistics

Problem I want to calculate a bunch of "easy to gather" statistics using Dask. Speed is my primary concern and objective, and so I am looking to throw a wide cluster at the problem. Ideally I would like to finish the described problem in less than…
bluecoconut
  • 63
  • 1
  • 5
4
votes
0 answers

Dealing with large grib files using xarray and dask

I'm reading some (apparently) large grib files using xarray. I say 'apparently' because they're ~100MB each, which doesn't seem too big to me. However, running import xarray as xr ds = xr.open_dataset("gribfile.grib", engine="cfgrib") takes a…
jezza
  • 331
  • 2
  • 13
4
votes
0 answers

Dask Distributed client takes to long to initialize in jupyter lab

Trying to initialize a client with local cluster in Jupyter lab but hangs. This behaviour happens for python 3.5 and jupyter lab 0.35. import dask.dataframe as dd from dask import delayed from distributed import Client from distributed import…
Apostolos
  • 7,763
  • 17
  • 80
  • 150
4
votes
2 answers

How do Dask threads interact with OpenBLAS/MKL/…?

According to What threads do Dask Workers have active?, a dask worker has A pool of threads in which to run tasks. The documentation says If your computations are mostly numeric in nature (for example NumPy and Pandas computations) and release…
Labo
  • 2,482
  • 2
  • 18
  • 38
4
votes
1 answer

dask dataframe from python list of tuples

I am really new to dask. I want to create a dask dataframe from a python list of tuples. In pandas, you can use DataFrame.from_records to convert a list of tuples to a dataframe. What function can give me same functionality in dask. My data looks a…
Ali. K
  • 147
  • 1
  • 3
  • 8