Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
19
votes
2 answers

What is the "right" way to close a Dask LocalCluster?

I am trying to use dask-distributed on my laptop using a LocalCluster, but I have still not found a way to let my application close without raising some warnings or triggering some strange iterations with matplotlib (I am using the tkAgg…
SteP
  • 262
  • 1
  • 2
  • 9
19
votes
1 answer

How to use all the cpu cores using Dask?

I have a pandas series with more than 35000 rows. I want to use dask make it more efficient. However, I both the dask code and the pandas code are taking the same time. Initially "ser" is pandas series and fun1 and fun2 are basic functions…
ANKIT JHA
  • 359
  • 1
  • 3
  • 9
19
votes
2 answers

What is the role of npartitions in a Dask dataframe?

I see the paramter npartitions in many functions, but I don't understand what it is good for / used for. http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_csv head(...) Elements are only taken from the first npartitions, with…
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
19
votes
3 answers

Nested data in Parquet with Python

I have a file that has one JSON per line. Here is a sample: { "product": { "id": "abcdef", "price": 19.99, "specs": { "voltage": "110v", "color": "white" } }, "user": "Daniel…
Daniel Severo
  • 1,768
  • 2
  • 15
  • 22
19
votes
5 answers

dask dataframe how to convert column to to_datetime

I am trying to convert one column of my dataframe to datetime. Following the discussion here https://github.com/dask/dask/issues/863 I tried the following code: import dask.dataframe as dd df['time'].map_partitions(pd.to_datetime,…
dleal
  • 2,244
  • 6
  • 27
  • 49
19
votes
2 answers

How to specify metadata for dask.dataframe

The docs provide good examples, how metadata can be provided. However I still feel unsure, when it comes to picking the right dtypes for my dataframe. Could I do something like meta={'x': int 'y': float, 'z': float} instead of meta={'x': 'i8', 'y':…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
19
votes
3 answers

Speeding up reading of very large netcdf file in python

I have a very large netCDF file that I am reading using netCDF4 in python I cannot read this file all at once since its dimensions (1200 x 720 x 1440) are too big for the entire file to be in memory at once. The 1st dimension represents time, and…
user308827
  • 21,227
  • 87
  • 254
  • 417
18
votes
1 answer

How to force parquet dtypes when saving pd.DataFrame?

Is there a way to force a parquet file to encode a pd.DataFrame column as a given type, even though all values for the column are null? The fact that parquet automatically assigns "null" in its schema is preventing me from loading many files into a…
HugoMailhot
  • 1,275
  • 1
  • 10
  • 19
18
votes
1 answer

How to set up logging on dask distributed workers?

After upgrading of dask distributed to version 1.15.0 my logging stopped working. I've used logging.config.dictConfig to initialize python logging facilities, and previously these settings propagated to all workers. But after upgrade it doesn't work…
Alexander Reshytko
  • 2,126
  • 1
  • 20
  • 28
18
votes
1 answer

How to specify the number of threads/processes for the default dask scheduler

Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)? With compute, you can specify it by using: df.compute(get=dask.threaded.get, num_workers=20) But I was wondering if there is a…
joris
  • 133,120
  • 36
  • 247
  • 202
18
votes
2 answers

Error with OMP_NUM_THREADS when using dask distributed

I am using distributed, a framework to allow parallel computation. In this, my primary use case is with NumPy. When I include NumPy code that relies on np.linalg, I get an error with OMP_NUM_THREADS, which is related to the OpenMP library. An…
Scott
  • 2,568
  • 1
  • 27
  • 39
17
votes
1 answer

How do I stop a running task in Dask?

When using Dask's distributed scheduler I have a task that is running on a remote worker that I want to stop. How do I stop it? I know about the cancel() method, but this doesn't seem to work if the task has already started executing.
MRocklin
  • 55,641
  • 23
  • 163
  • 235
17
votes
4 answers

Simple way to Dask concatenate (horizontal, axis=1, columns)

Action Reading two csv (data.csv and label.csv) to a single dataframe. df = dd.read_csv(data_files, delimiter=' ', header=None, names=['x', 'y', 'z', 'intensity', 'r', 'g', 'b']) df_label = dd.read_csv(label_files, delimiter=' ', header=None,…
Tom Hemmes
  • 2,000
  • 2
  • 17
  • 23
17
votes
1 answer

Is saving a HUGE dask dataframe into parquet possible?

I have a dataframe made up of 100,000+ rows and each row has 100,000 columns, totally to 10,000,000,000 float values. I've managed to read them in previously in a csv (tab-separated) file and I successfully read them to a 50 cores Xeon machine with…
alvas
  • 115,346
  • 109
  • 446
  • 738
16
votes
2 answers

Efficient way to read 15 M lines csv files in python

For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and…
Gabriel Dante
  • 392
  • 1
  • 12