Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
6
votes
1 answer

How to load a directory of GRIB files into a Dask array

Suppose I have a directory with thousands of GRIB files. I want to load those files into a dask array so I can query them. How can I go about doing this? The attempt below seems to work, but it requires each GRIB file to be opened, and it takes a…
6
votes
2 answers

Loading local file from client onto dask distributed cluster

A bit of a beginner question, but I was not able to find a relevant answer on this.. Essentially my data about (7gb) is located on my local machine. I have distributed cluster running on the local network. How can I get this file onto the cluster?…
Bot Man
  • 63
  • 1
  • 5
6
votes
2 answers

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations. Code snippet: from dask import dataframe as…
kuanb
  • 1,618
  • 2
  • 20
  • 42
6
votes
1 answer

Dask reading CSV, setting partition as CSV length

I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe. Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each…
abcdefg
  • 65
  • 1
  • 5
6
votes
1 answer

How can I see the data preview of Dask DataFrame?

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using df.head() it is taking too much time. How can I view the dataframe?
Hari
  • 173
  • 1
  • 3
  • 11
6
votes
2 answers

move from pandas to dask to utilize all local cpu cores

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
6
votes
1 answer

Slow Dask performance on CSV date parsing?

I've been doing a lot of text processing on a big pile of files including large CSVs and lots and lots of little XML files. Sometimes I'm doing aggregate counts but a lot of times I'm doing NLP-type work to do deeper looks at what is in these files…
Mike Shea
  • 857
  • 1
  • 9
  • 13
6
votes
1 answer

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame? Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
6
votes
1 answer

Aggregation fails when using lambdas

I'm trying to port parts of my application from pandas to dask and I hit a roadblock when using a lamdba function in a groupby on a dask DataFrame. import dask.dataframe as dd dask_df = dd.from_pandas(pandasDataFrame, npartitions=2) dask_df =…
barney.balazs
  • 630
  • 7
  • 19
6
votes
1 answer

Python Dask - dataframe.map_partitions() return value

So dask.dataframe.map_partitions() takes a func argument and the meta kwarg. How exactly does it decide its return type? As an example: Lots of csv's in ...\some_folder. ddf = dd.read_csv(r"...\some_folder\*", usecols=['ColA', 'ColB'], …
StarFox
  • 529
  • 3
  • 10
6
votes
1 answer

How to execute a multi-threaded `merge()` with dask? How to use multiples cores via qsub?

I've just begun using dask, and I'm still fundamentally confused how to do simple pandas tasks with multiple threads, or using a cluster. Let's take pandas.merge() with dask dataframes. import dask.dataframe as dd df1 =…
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
6
votes
1 answer

Lazy create Dask DataFrame from PostgreSQL / Cassandra

As I understand Dask DataFrame is proper way to handle tabular data like. I have a table in PostgreSQL, and I knowthe way to load it into pandas.Dataframe. I know, odo can be used to conver pandas.DataFrame to dask.dataframe. But This is not lazy…
Sklavit
  • 2,225
  • 23
  • 29
6
votes
1 answer

What is the return value of map_partitions?

The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like)…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
6
votes
1 answer

Dask dataframe: Memory error with merge

I'm playing with some github user data and was trying to create a graph of all people in the same city. To do this i need to use the merge operation in dask. Unfortunately the github user base size is 6M and it seems that the merge operation is…
Prasanjit Prakash
  • 409
  • 1
  • 6
  • 21
6
votes
1 answer

Multiplying large matrices with dask

I am working on a project which basically boils down to solving the matrix equation A.dot(x) = d where A is a matrix with dimensions roughly 10 000 000 by 2000 (I would like to increase this in both directions eventually). A obviously does not fit…
sulkeh
  • 935
  • 7
  • 21