Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

How to select data with list of indexes from a partitioned DF (non-unique indexes)?

Problem I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N]. I need to select rows based on a indexes list [0..M] where M > N. Using loc would yield to an inconsistent…

asked Apr 12 '17 at 08:59

w00dy

votes

2 answers

Open large files from S3

When I try to open a large file from S3, I get memory error. import dask.dataframe as dd df = dd.read_csv('s3://xxxx/test_0001_part_03.gz', storage_options={'anon': True}, compression='gzip', error_bad_lines=False) df.head() exception:…

dask

asked Apr 11 '17 at 13:51

shantanuo

31,689
78
245
403

votes

1 answer

How to unregister all ProgressBars?

The dask documentation explains that a ProgressBar can be unregistered by calling pbar.unregister(), where pbar is the respective ProgressBar instance. However, this method only works, if the user has access to that ProgressBar instance. Using…

python dask

asked Mar 30 '17 at 13:32

Arco Bast

3,595
2
26
53

votes

1 answer

Hiding some parameters from the hashing in the delayed key (to_task_dask)

Consider the following usage: In [49]: class MyClass(dict): ...: def __init__(self,a): ...: self.a = a ...: def get(self): ...: return a ...: In [50]: a = MyClass(10) In [51]: @delayed(pure=True) …

dask

asked Mar 22 '17 at 20:42

julienl

votes

1 answer

Understanding How Partitions Work in Dask

I have a CSV with 17,850,209 rows which is too large for Pandas to handle my entire code so I am trying to use Dask to operate on it. All of my code "works" but when I write a CSV to disk I don't get all of the 17,850,209 records. Instead I get N…

python dask

asked Mar 14 '17 at 13:57

Frank B.

1,813
5
24
44

votes

1 answer

dask s3 access on ec2 workers

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command: client =…

python amazon-s3 dask

asked Mar 02 '17 at 12:00

zseder

1,099
2
12
15

votes

1 answer

Dask array to HDF5 parallel write fails with multiprocessing scheduler

Dask being a well documented scalable library for parallel processing, using graph based workflows is extremely useful in writing many applications that have inherent parallelism associated with them. However while parallel writing to hdf5 files…

hdf5 h5py dask

asked Mar 01 '17 at 05:45

Suraj

votes

1 answer

How to specify number of workers in Dask.array

Suppose that you want to specify the number of workers in Dask.array, as Dask documentation shows, you can set: dask.set_options(pool=ThreadPool(num_workers)) This works pretty well with some simulations I've run, for example, montecarlo's, but…

python dask

asked Feb 24 '17 at 15:50

Guillermo Cornejo Suárez

votes

2 answers

dask using delayed to construct a list of functions but specify the number of process to use

I have a function to do computation, here is a simple one as an example, def add(a,b): return a+b And then I execute this function 100 times in a embarrassingly parallel way, output = [delayed(add)(i,i+1) for i in…

python dask

asked Feb 23 '17 at 08:21

tesla1060

2,621
6
31
43

votes

1 answer

Grouping dask.bag items into distinct partitions

I was wondering if somebody could help me understand the way Bag objects handle partitions. Put simply, I am trying to group items currently in a Bag so that each group is in its own partition. What's confusing me is that the Bag.groupby() method…

dask

asked Feb 22 '17 at 17:14

ajmazurie

votes

0 answers

Dask datetime Python3 dataframe read_csv

With a csv file laid out like this dtime,Ask,Bid,AskVolume,BidVolume 2003-08-04 00:01:06.430000,1.93273,1.93233,2400000,5100000 2003-08-04 00:01:15.419000,1.93256,1.93211,21900000,4000000 2003-08-04…

python pandas dataframe dask

asked Feb 13 '17 at 03:00

user2777145

votes

1 answer

Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code: vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace') classifier = OneVsRestClassifier(LinearSVC()) pipeline = Pipeline([ ('vect', vectorizer), ('clf', classifier)]) with…

parallel-processing scikit-learn data-science dask joblib

asked Feb 13 '17 at 03:00

gman9732

votes

1 answer

Dask rolling function by group syntax

I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that…

python dask

asked Feb 11 '17 at 04:09

J. Patanian

votes

1 answer

Forking, sqlalchemy, and scoped sessions

I'm getting the following error (which I assume is because of the forking in my application), "This result object does not return rows". Traceback --------- File "/opt/miniconda/envs/analytical-engine/lib/python2.7/site-packages/dask/async.py",…

python sqlalchemy multiprocessing dask

asked Feb 09 '17 at 04:31

Jerry Londergaard

votes

0 answers

dask set_index gives different index type compared to that from_pandas?

I am trying to read a csv using dask and then resample it based on its timestamp index. The csv file has content like: Time,data 2015-01-01,0 2015-01-02,1 2015-01-03,2 2015-01-04,3 ... Method 1: Using dask to load the data directly and then setup…

python dataframe dask

asked Feb 05 '17 at 05:56

DigitalPig

Prev 1 2 3

…

99 100 Next