Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to load a directory of GRIB files into a Dask array

Suppose I have a directory with thousands of GRIB files. I want to load those files into a dask array so I can query them. How can I go about doing this? The attempt below seems to work, but it requires each GRIB file to be opened, and it takes a…

gdal dask grib

asked May 08 '17 at 18:00

Philip Blankenau

votes

2 answers

Loading local file from client onto dask distributed cluster

A bit of a beginner question, but I was not able to find a relevant answer on this.. Essentially my data about (7gb) is located on my local machine. I have distributed cluster running on the local network. How can I get this file onto the cluster?…

python python-3.x dask

asked May 05 '17 at 04:53

Bot Man

votes

2 answers

On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations. Code snippet: from dask import dataframe as…

python parallel-processing dask

asked Apr 14 '17 at 18:02

kuanb

1,618
2
20
42

votes

1 answer

Dask reading CSV, setting partition as CSV length

I'm trying to write code that will read from a set of CSVs named my_file_*.csv into a Dask dataframe. Then I want to set the partitions based on the length of the CSV. I'm trying to map a function on each partition and in order to do that, each…

python csv distributed dask

asked Mar 31 '17 at 19:46

abcdefg

votes

1 answer

How can I see the data preview of Dask DataFrame?

I created Dask DataFrame from Pandas DataFrame and applied few functions on it. When I'm trying to view the data using df.head() it is taking too much time. How can I view the dataframe?

python dataframe dask preview

asked Mar 23 '17 at 12:22

Hari

votes

2 answers

move from pandas to dask to utilize all local cpu cores

Recently I stumbled upon http://dask.pydata.org/en/latest/ As I have some pandas code which only runs on a single core I wonder how to make use of my other CPU cores. Would dask work well to use all (local) CPU cores? If yes how compatible is it to…

python pandas cpu multicore dask

asked Mar 07 '17 at 13:11

Georg Heiler

16,916
36
162
292

votes

1 answer

Slow Dask performance on CSV date parsing?

I've been doing a lot of text processing on a big pile of files including large CSVs and lots and lots of little XML files. Sometimes I'm doing aggregate counts but a lot of times I'm doing NLP-type work to do deeper looks at what is in these files…

python multithreading performance pandas dask

asked Jan 15 '17 at 15:02

Mike Shea

votes

1 answer

Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas

I have a directory of timeseries data stored as CSV files, one file per day. How do I load and process it efficiently with Dask DataFrame? Disclaimer: I maintain Dask. This question occurs often enough in other channels that I decided to add a…

python pandas dask

asked Dec 24 '16 at 16:38

MRocklin

55,641
23
163
235

votes

1 answer

Aggregation fails when using lambdas

I'm trying to port parts of my application from pandas to dask and I hit a roadblock when using a lamdba function in a groupby on a dask DataFrame. import dask.dataframe as dd dask_df = dd.from_pandas(pandasDataFrame, npartitions=2) dask_df =…

python dask

asked Nov 28 '16 at 13:10

barney.balazs

votes

1 answer

Python Dask - dataframe.map_partitions() return value

So dask.dataframe.map_partitions() takes a func argument and the meta kwarg. How exactly does it decide its return type? As an example: Lots of csv's in ...\some_folder. ddf = dd.read_csv(r"...\some_folder\*", usecols=['ColA', 'ColB'], …

python pandas dask

asked Nov 17 '16 at 18:57

StarFox

votes

1 answer

How to execute a multi-threaded `merge()` with dask? How to use multiples cores via qsub?

I've just begun using dask, and I'm still fundamentally confused how to do simple pandas tasks with multiple threads, or using a cluster. Let's take pandas.merge() with dask dataframes. import dask.dataframe as dd df1 =…

python multithreading pandas cluster-computing dask

asked Oct 14 '16 at 22:40

ShanZhengYang

16,511
49
132
234

votes

1 answer

Lazy create Dask DataFrame from PostgreSQL / Cassandra

As I understand Dask DataFrame is proper way to handle tabular data like. I have a table in PostgreSQL, and I knowthe way to load it into pandas.Dataframe. I know, odo can be used to conver pandas.DataFrame to dask.dataframe. But This is not lazy…

python postgresql dataframe cassandra dask

asked Oct 06 '16 at 23:33

Sklavit

2,225
23
29

votes

1 answer

What is the return value of map_partitions?

The dask API says, that map_partition can be used to "apply a Python function on each DataFrame partition." From this description and according to the usual behaviour of "map", I would expect the return value of map_partitions to be (something like)…

python pandas dask

asked Aug 29 '16 at 21:38

Arco Bast

3,595
2
26
53

votes

1 answer

Dask dataframe: Memory error with merge

I'm playing with some github user data and was trying to create a graph of all people in the same city. To do this i need to use the merge operation in dask. Unfortunately the github user base size is 6M and it seems that the merge operation is…

python dask

asked Aug 24 '16 at 05:21

Prasanjit Prakash

votes

1 answer

Multiplying large matrices with dask

I am working on a project which basically boils down to solving the matrix equation A.dot(x) = d where A is a matrix with dimensions roughly 10 000 000 by 2000 (I would like to increase this in both directions eventually). A obviously does not fit…

python matrix dask

asked Feb 11 '16 at 15:01

sulkeh

Prev 1 2 3

…

99 100 Next