Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
2
votes
1 answer

How to select data with list of indexes from a partitioned DF (non-unique indexes)?

Problem I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N]. I need to select rows based on a indexes list [0..M] where M > N. Using loc would yield to an inconsistent…
w00dy
  • 748
  • 1
  • 6
  • 23
2
votes
2 answers

Open large files from S3

When I try to open a large file from S3, I get memory error. import dask.dataframe as dd df = dd.read_csv('s3://xxxx/test_0001_part_03.gz', storage_options={'anon': True}, compression='gzip', error_bad_lines=False) df.head() exception:…
shantanuo
  • 31,689
  • 78
  • 245
  • 403
2
votes
1 answer

How to unregister all ProgressBars?

The dask documentation explains that a ProgressBar can be unregistered by calling pbar.unregister(), where pbar is the respective ProgressBar instance. However, this method only works, if the user has access to that ProgressBar instance. Using…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
2
votes
1 answer

Hiding some parameters from the hashing in the delayed key (to_task_dask)

Consider the following usage: In [49]: class MyClass(dict): ...: def __init__(self,a): ...: self.a = a ...: def get(self): ...: return a ...: In [50]: a = MyClass(10) In [51]: @delayed(pure=True) …
julienl
  • 161
  • 12
2
votes
1 answer

Understanding How Partitions Work in Dask

I have a CSV with 17,850,209 rows which is too large for Pandas to handle my entire code so I am trying to use Dask to operate on it. All of my code "works" but when I write a CSV to disk I don't get all of the 17,850,209 records. Instead I get N…
Frank B.
  • 1,813
  • 5
  • 24
  • 44
2
votes
1 answer

dask s3 access on ec2 workers

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command: client =…
zseder
  • 1,099
  • 2
  • 12
  • 15
2
votes
1 answer

Dask array to HDF5 parallel write fails with multiprocessing scheduler

Dask being a well documented scalable library for parallel processing, using graph based workflows is extremely useful in writing many applications that have inherent parallelism associated with them. However while parallel writing to hdf5 files…
Suraj
  • 53
  • 5
2
votes
1 answer

How to specify number of workers in Dask.array

Suppose that you want to specify the number of workers in Dask.array, as Dask documentation shows, you can set: dask.set_options(pool=ThreadPool(num_workers)) This works pretty well with some simulations I've run, for example, montecarlo's, but…
2
votes
2 answers

dask using delayed to construct a list of functions but specify the number of process to use

I have a function to do computation, here is a simple one as an example, def add(a,b): return a+b And then I execute this function 100 times in a embarrassingly parallel way, output = [delayed(add)(i,i+1) for i in…
tesla1060
  • 2,621
  • 6
  • 31
  • 43
2
votes
1 answer

Grouping dask.bag items into distinct partitions

I was wondering if somebody could help me understand the way Bag objects handle partitions. Put simply, I am trying to group items currently in a Bag so that each group is in its own partition. What's confusing me is that the Bag.groupby() method…
ajmazurie
  • 509
  • 4
  • 8
2
votes
0 answers

Dask datetime Python3 dataframe read_csv

With a csv file laid out like this dtime,Ask,Bid,AskVolume,BidVolume 2003-08-04 00:01:06.430000,1.93273,1.93233,2400000,5100000 2003-08-04 00:01:15.419000,1.93256,1.93211,21900000,4000000 2003-08-04…
2
votes
1 answer

Scikit-Learn with Dask-Distributed using nested parallelism?

For example suppose I have the code: vectorizer = CountVectorizer(input=u'filename', decode_error=u'replace') classifier = OneVsRestClassifier(LinearSVC()) pipeline = Pipeline([ ('vect', vectorizer), ('clf', classifier)]) with…
2
votes
1 answer

Dask rolling function by group syntax

I struggled for a while with the syntax to work for calculating a rolling function by group for a dask dataframe. The documentation is excellent, but in this case does not have an example. The working version I have is as follows, from a csv that…
J. Patanian
  • 71
  • 1
  • 5
2
votes
1 answer

Forking, sqlalchemy, and scoped sessions

I'm getting the following error (which I assume is because of the forking in my application), "This result object does not return rows". Traceback --------- File "/opt/miniconda/envs/analytical-engine/lib/python2.7/site-packages/dask/async.py",…
2
votes
0 answers

dask set_index gives different index type compared to that from_pandas?

I am trying to read a csv using dask and then resample it based on its timestamp index. The csv file has content like: Time,data 2015-01-01,0 2015-01-02,1 2015-01-03,2 2015-01-04,3 ... Method 1: Using dask to load the data directly and then setup…
DigitalPig
  • 83
  • 6