Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Converting numpy array into dask dataframe column?

I have a numpy array that i want to add as a column in a existing dask dataframe. enc = LabelEncoder() nparr = enc.fit_transform(X[['url']]) I have ddf of type dask dataframe. ddf['nurl'] = nparr ??? Any elegant way to achieve above…

asked Aug 22 '19 at 10:21

Irshad Ali

1,153
1
13
39

votes

1 answer

Using Dask from script

Is it possible to run dask from a python script? In interactive session I can just write from dask.distributed import Client client = Client() as described in all tutorials. If I write these lines however in a script.py file and execute it python…

python-3.x dask dask-distributed

asked Aug 20 '19 at 14:26

DerWeh

1,721
1
15
26

votes

1 answer

Dask - Quickest way to get row length of each partition in a Dask dataframe

I'd like to get the length of each partition in a number of dataframes. I'm presently getting each partition and then getting the size of the index for each partition. This is very, very slow. Is there a better way? Here's a simplified snippet of…

dask dask-distributed dask-delayed fastparquet

asked Aug 13 '19 at 17:44

dan

votes

1 answer

How to skip bad lines when reading with dask?

I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column: +-----------------------------+--------+----------+ | Column | Found | Expected…

python dask

asked Aug 08 '19 at 20:15

davidaap

1,569
1
18
43

votes

1 answer

Dask map_blocks - IndexError: tuple index out of range

I want to do the following with Dask: Load a matrix from a HDF5 file Parallelize the calculation of each entry Here is my code: def blocked_func(x): return np.random.random() with h5py.File(file_path) as f: d = f['/data'] arr =…

python-3.x dask dask-delayed

asked Jul 05 '19 at 11:36

Andy R

1,339
10
20

votes

1 answer

How does Dask handle external or global variables in function definitions?

If I have a function that depends on some global or other constant like the following: x = 123 def f(partition): return partition + x # note that x is defined outside this function df = df.map_partitions(f) Does this work? Or do I need to…

dask

asked Jun 25 '19 at 16:43

MRocklin

55,641
23
163
235

votes

1 answer

Why does first import of skimage fail, but second one succeed?

When I import skimage, I get an odd error message that seems to be connected to version mismatch issues with scikit-image, numpy and dask, but if I immediately try to import again, everything is fine -- i.e. (base) me@balin:~$ python Python 2.7.15…

python python-2.7 numpy dask scikit-image

asked Jun 21 '19 at 20:11

user1245262

6,968
8
50
77

votes

2 answers

How to read a large parquet file as multiple dataframes?

I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?

python pyspark dask parquet pyarrow

asked Jun 18 '19 at 10:10

Rahul

votes

2 answers

Is it possible to read a .csv from a remote server, using Paramiko and Dask's read_csv() method in conjunction?

Today I began using the Dask and Paramiko packages, partly as a learning exercise, and partly because I'm beginning a project that will require dealing with large datasets (10s of GB) that must be accessed from a remote VM only (i.e. cannot store…

python pandas ssh paramiko dask

asked Jun 16 '19 at 23:21

digital_hen

votes

1 answer

reading a Dask DataFrame from CSVs in a deep S3 path hierarchy

I am trying to read a set of CSVs in S3 in a Dask DataFrame. The bucket has a deep hierarchy and contains some metadata files as well. the call looks like dd.read_csv('s3://mybucket/dataset/*/*/*/*/*/*.csv') This causes Dask to hang. The real…

amazon-s3 boto3 boto dask dask-distributed

asked Jun 12 '19 at 03:46

Daniel Mahler

7,653
5
51
90

votes

2 answers

How to set up (calculate) divisions in dask dataframe?

When loading data from parquet or csv files, having the NONE divisions. DASK docs have no information about how to set and calculate this.... How to set up and calculate right the divisions of DASK dataframe?

python dask

asked Jun 05 '19 at 14:16

VadimCh

votes

1 answer

randomly mask/set nan x% of data points in huge xarray.DataArray

I have a huge (~ 2 billion data points) xarray.DataArray. I would like to randomly delete (either mask or replace by np.nan) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same…

python numpy dask python-xarray

asked May 22 '19 at 13:01

climachine

votes

1 answer

dask.dataframe.read_parquet takes too long

I tried to read parquet from s3 like this: import dask.dataframe as dd s3_path = "s3://my_bucket/my_table" times = dd.read_parquet( s3_path, storage_options={ "client_kwargs": { …

python-3.x dask

asked May 13 '19 at 18:56

Sean Nguyen

12,528
22
74
113

votes

1 answer

clustering large data set using dask

I ve installed dask. My main aim is clustering a large dataset, but before starting work on it, I want to make a few tests. However, whenever I want to run a dask code piece, it takes too much time and a memory error appears at the end. I tried…

scikit-learn cluster-computing cluster-analysis dask dbscan

asked May 09 '19 at 15:11

emily.mi

votes

1 answer

Dask Distributed - Same persist data multiple clients

We are trying Dask Distributed to make some heavy computes and visualization for a frontend. Now we have one worker with gunicorn that connects to an existing Distributed Dask cluster, the worker uploads the data currently with read_csv and persist…

dask dask-distributed

asked May 07 '19 at 06:15

CValenzu

Prev 1 2 3

…

99 100 Next