Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Is it possible to get the intersection of sets using dask?

I have a big dataset (50 million rows) in which I need to do some row-wise computations like getting the intersection of two sets (each in a different column) e.g. col_1:{1587004, 1587005, 1587006, 1587007} col_2:{1587004,…
Ivo Leist
  • 408
  • 3
  • 12
3
votes
1 answer

xarray with dask sel is slow

A series of about 90 netCDF files each around 27 MB each, opened with xarray's open_mfdataset takes a long time to load a small space-time selection. Chunking dimensions yield marginal gain. decode_cf=True either inside the function or separate has…
3
votes
1 answer

Read_json() dask is parallel?

I have the below code. It uses dask distributed to read 100 json files:(Workers: 5 Cores: 5 Memory: 50.00 GB) from dask.distributed import Client import dask.dataframe as dd client = Client('xxxxxxxx:8786') df =…
MT467
  • 668
  • 2
  • 15
  • 31
3
votes
1 answer

How can I get result of Dask compute on a different machine than the one that submitted it?

I am using Dask behind a Django server and the basic setup I have is summarised here: https://github.com/MoonVision/django-dask-demo/ where the Dask client can be found here:…
Matt Nicolls
  • 173
  • 1
  • 7
3
votes
1 answer

How to get the name of a csv file causing an error in dask.read_csv?

My objective is to parallelize reading many (500+) csv-files containing measurement data. To do so I pass a list of paths (source_files) to a synchronous client. Additionally I have specified dtypes, and column names (order_list). df =…
3
votes
0 answers

Dask hangs when using dask_xgboost train method

I am trying to reproduce the dask xgboost example from the dask-ml docs at http://ml.dask.org/examples/xgboost.html. Unfortunately, Dask doesn't seem to complete the training and I'm having a hard time tracking down the meaning of the errors and…
chicagoson
  • 31
  • 1
3
votes
1 answer

Using Dask on an apply returning several columns (a DataFrame so)

I'm trying to use dask on an apply with a function that outputs 5 floats. I'll simplify in a example here. def func1(row, param): return float(row.Val1) * param, float(row.Val1) * np.power(param, 2) data = pd.DataFrame(np.array([["A01", 12],…
mbahin
  • 129
  • 2
  • 10
3
votes
1 answer

DASK Metadata mismatch found in 'from_delayed' JSON file

I'm just starting my adventure with DASK and land I'm learning on an example dataset in json format. I know that this is not the easiest data format in the world for a beginner :) I have a dataset in the json format. I loaded the data via…
AWL
  • 31
  • 1
  • 2
3
votes
1 answer

Actors and dask-workers

client = Client('127.0.0.1:8786',direct_to_workers=True) future1 = client.submit(Counter, workers= 'ninja',actor=True) counter1 = future1.result() print(counter1) All is well but what if the client gets restarted? How do I…
chak
  • 31
  • 2
3
votes
1 answer

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
Eghbal
  • 3,892
  • 13
  • 51
  • 112
3
votes
1 answer

Multiple merge in Dask and field names

I am trying to merge multiple pandas dataframes onto a large Dask dataframe with fields ["a_id", "b_id", "c_id"]. Each pandas dataframe "A", "B", and "C" has a unique field ("a_id", "b_id", and "c_id") that joins it to the Dask dataframe. "B" and…
triphook
  • 2,915
  • 3
  • 25
  • 34
3
votes
1 answer

How do I share a large read-only object across Dask distributed workers

The Problem I'm trying to send a 2GB CPython read-only object (can be pickled) to dask distributed workers via apply(). This ends up consuming a lot of memory for processes/ threads (14+ GB). Is there a way to load the object only once into memory…
3
votes
1 answer

Dask dataframe in requirements.txt?

I need to install a few python packages in a Docker container via requirements.txt, using pip install. One of the packages is dask. However, when installing it, it throws an error because it cannot find the package toolz. The question has been…
giosans
  • 1,136
  • 1
  • 12
  • 30
3
votes
0 answers

How to read csv in parallel and write in Cassandra in parallel for achieving high throughput?

I have tried using execute, execute_async and execute_concurrent in Cassandra but for reading 10M rows, I could index them in Cassandra in no less than 55 mins. Note that I have had set the concurrent threads to 1000, tuned the YAML file's…
aviral sanjay
  • 953
  • 2
  • 14
  • 31
3
votes
2 answers

euclidean distance calculation using Python and Dask

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array…