Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Converting numpy array into dask dataframe column?

I have a numpy array that i want to add as a column in a existing dask dataframe. enc = LabelEncoder() nparr = enc.fit_transform(X[['url']]) I have ddf of type dask dataframe. ddf['nurl'] = nparr ??? Any elegant way to achieve above…
Irshad Ali
  • 1,153
  • 1
  • 13
  • 39
3
votes
1 answer

Using Dask from script

Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python…
DerWeh
  • 1,721
  • 1
  • 15
  • 26
3
votes
1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…
dan
  • 183
  • 13
3
votes
1 answer

How to skip bad lines when reading with dask?

I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column: +-----------------------------+--------+----------+ | Column | Found | Expected…
davidaap
  • 1,569
  • 1
  • 18
  • 43
3
votes
1 answer

Dask map_blocks - IndexError: tuple index out of range

I want to do the following with Dask: Load a matrix from a HDF5 file Parallelize the calculation of each entry Here is my code: def blocked_func(x): return np.random.random() with h5py.File(file_path) as f: d = f['/data'] arr =…
Andy R
  • 1,339
  • 10
  • 20
3
votes
1 answer

How does Dask handle external or global variables in function definitions?

If I have a function that depends on some global or other constant like the following: x = 123 def f(partition): return partition + x # note that x is defined outside this function df = df.map_partitions(f) Does this work? Or do I need to…
MRocklin
  • 55,641
  • 23
  • 163
  • 235
3
votes
1 answer

Why does first import of skimage fail, but second one succeed?

When I import skimage, I get an odd error message that seems to be connected to version mismatch issues with scikit-image, numpy and dask, but if I immediately try to import again, everything is fine -- i.e. (base) me@balin:~$ python Python 2.7.15…
user1245262
  • 6,968
  • 8
  • 50
  • 77
3
votes
2 answers

How to read a large parquet file as multiple dataframes?

I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?
Rahul
  • 161
  • 2
  • 6
3
votes
2 answers

Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store…
digital_hen
  • 91
  • 1
  • 6
3
votes
1 answer

reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv') This causes Dask to hang. The real…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
3
votes
2 answers

How to set up (calculate) divisions in dask dataframe?

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this.... How to set up and calculate right the divisions of DASK dataframe?
VadimCh
  • 71
  • 1
  • 9
3
votes
1 answer

randomly mask/set nan x% of data points in huge xarray.DataArray

I have a huge (~ 2 billion data points) xarray.DataArray. I would like to randomly delete (either mask or replace by np.nan) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same…
climachine
  • 55
  • 7
3
votes
1 answer

dask.dataframe.read_parquet takes too long

I tried to read parquet from s3 like this: import dask.dataframe as dd s3_path = "s3://my_bucket/my_table" times = dd.read_parquet( s3_path, storage_options={ "client_kwargs": { …
Sean Nguyen
  • 12,528
  • 22
  • 74
  • 113
3
votes
1 answer

clustering large data set using dask

I ve installed dask. My main aim is clustering a large dataset, but before starting work on it, I want to make a few tests. However, whenever I want to run a dask code piece, it takes too much time and a memory error appears at the end. I tried…
3
votes
1 answer

Dask Distributed - Same persist data multiple clients

We are trying Dask Distributed to make some heavy computes and visualization for a frontend. Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist…
CValenzu
  • 31
  • 2