Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

What is the "right" way to close a Dask LocalCluster?

I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg…

python dask dask-distributed

asked Nov 20 '18 at 14:13

SteP

votes

1 answer

How to use all the cpu cores using Dask?

I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions…

dask dask-distributed dask-delayed

asked Jul 06 '18 at 14:34

ANKIT JHA

votes

2 answers

What is the role of npartitions in a Dask dataframe?

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with…

python dataframe dask

asked Oct 09 '17 at 11:35

Martin Thoma

124,992
159
614
958

votes

3 answers

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample: { "product": { "id": "abcdef", "price": 19.99, "specs": { "voltage": "110v", "color": "white" } }, "user": "Daniel…

python json parquet dask

asked Jul 27 '17 at 04:01

Daniel Severo

1,768
2
15
22

votes

5 answers

dask dataframe how to convert column to to_datetime

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code: import dask.dataframe as dd df['time'].map_partitions(pd.to_datetime,…

python pandas dask

asked Sep 20 '16 at 00:50

dleal

2,244
6
27
49

votes

2 answers

How to specify metadata for dask.dataframe

The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe. Could I do something like meta={'x': int 'y': float, 'z': float} instead of meta={'x': 'i8', 'y':…

python pandas dask

asked Sep 01 '16 at 07:33

Arco Bast

3,595
2
26
53

votes

3 answers

Speeding up reading of very large netcdf file in python

I have a very large netCDF file that I am reading using netCDF4 in python I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and…

python numpy netcdf dask python-xarray

asked Feb 16 '16 at 02:57

user308827

21,227
87
254
417

votes

1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…

python pandas parquet dask pyarrow

asked May 01 '18 at 00:45

HugoMailhot

1,275
1
10
19

votes

1 answer

How to set up logging on dask distributed workers?

After upgrading of dask distributed to version 1.15.0 my logging stopped working. I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work…

dask

asked Jan 05 '17 at 00:00

Alexander Reshytko

2,126
1
20
28

votes

1 answer

How to specify the number of threads/processes for the default dask scheduler

Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)? With compute, you can specify it by using: df.compute(get=dask.threaded.get, num_workers=20) But I was wondering if there is a…

python dask

asked Nov 15 '16 at 23:28

joris

133,120
36
247
202

votes

2 answers

Error with OMP_NUM_THREADS when using dask distributed

I am using distributed, a framework to allow parallel computation. In this, my primary use case is with NumPy. When I include NumPy code that relies on np.linalg, I get an error with OMP_NUM_THREADS, which is related to the OpenMP library. An…

python numpy cluster-computing dask

asked Sep 10 '16 at 03:01

Scott

2,568
1
27
39

votes

1 answer

How do I stop a running task in Dask?

When using Dask's distributed scheduler I have a task that is running on a remote worker that I want to stop. How do I stop it? I know about the cancel() method, but this doesn't seem to work if the task has already started executing.

dask

asked Mar 09 '18 at 22:26

MRocklin

55,641
23
163
235

votes

4 answers

Simple way to Dask concatenate (horizontal, axis=1, columns)

Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None,…

python pandas dask

asked Oct 24 '17 at 12:57

Tom Hemmes

2,000
2
17
23

votes

1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…

python dataframe parquet dask fastparquet

asked May 26 '17 at 06:00

alvas

115,346
109
446
738

votes

2 answers

Efficient way to read 15 M lines csv files in python

For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and…

python pandas dataframe dask

asked Jul 01 '19 at 14:24

Gabriel Dante

Prev 1 2

…

99 100 Next