Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
8
votes
2 answers

combining tqdm with delayed execution with dask in python

tqdm and dask are both amazing packages for iterations in python. While tqdm implements the needed progress bar, dask implements the multi-thread platform and they both can make iteration process less frustrating. Yet - I'm having troubles to…
Dimgold
  • 2,748
  • 5
  • 26
  • 49
8
votes
1 answer

Slicing a Dask Dataframe

I have the following code where I like to do a train/test split on a Dask dataframe df = dd.read_csv(csv_filename, sep=',', encoding="latin-1", names=cols, header=0, dtype='str') But when I try to do slices like for train,…
Zubair Ahmed
  • 725
  • 8
  • 15
8
votes
1 answer

Dask DataFrame aggregate to median

I am trying to aggregate a dask dataframe to set of metrics, including median, but it looks like that median is not supported. Any chance to aggregate and get median? st_agg = df.groupby(['start station id', 'end station…
Philipp_Kats
  • 3,872
  • 3
  • 27
  • 44
8
votes
2 answers

Python Dask - vertical concatenation of 2 DataFrames

I am trying to vertically concatenate two Dask DataFrames I have the following Dask DataFrame: d = [ ['A','B','C','D','E','F'], [1, 4, 8, 1, 3, 5], [6, 6, 2, 2, 0, 0], [9, 4, 5, 0, 6, 35], [0, 1, 7, 10, 9, 4], [0, 7, 2, 6, 1,…
edesz
  • 11,756
  • 22
  • 75
  • 123
8
votes
1 answer

Dask, create a dataframe from several dask arrays

Suppose I have a set of dask arrays such as: c1 = da.from_array(np.arange(100000, 190000), chunks=1000) c2 = da.from_array(np.arange(200000, 290000), chunks=1000) c3 = da.from_array(np.arange(300000, 390000), chunks=1000) is it possible to create a…
Jason Solack
  • 93
  • 1
  • 5
8
votes
2 answers

How to apply a function to a dask dataframe and return multiple values?

In pandas, I use the typical pattern below to apply a vectorized function to a df and return multiple values. This is really only necessary when the said function produces multiple independent outputs from a single task. See my overly trivial…
Jasper1918
  • 81
  • 1
  • 2
8
votes
1 answer

does npartitions influence the result of dask.dataframe.head()?

When running the following code, the result of dask.dataframe.head() depends on npartitions: import dask.dataframe as dd import pandas as pd df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]}) ddf = dd.from_pandas(df, npartitions =…
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
8
votes
1 answer

dask df.col.unique() vs df.col.drop_duplicates()

In dask what is the difference between df.col.unique() and df.col.drop_duplicates() Both return a series containing the unique elements of df.col. There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
8
votes
1 answer

dask computation not executing in parallel

I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import…
Daniel Mahler
  • 7,653
  • 5
  • 51
  • 90
8
votes
3 answers

dask bag not using all cores? alternatives?

I have a python script which does the following: i. which takes an input file of data (usually nested JSON format) ii. passes the data line by line to another function which manipulates the data into desired format iii. and finally it writes the…
tamjd1
  • 876
  • 1
  • 10
  • 29
7
votes
6 answers

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

Lets say I have the following dataframe: df1 = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1], columns = ['X'], index = ['a', 'a', 'a', 'b', 'b', 'b', …
hm8
  • 1,381
  • 3
  • 21
  • 41
7
votes
1 answer

How to execute a prefect Flow on a docker image?

My goal: I have a built docker image and want to run all my Flows on that image. Currently: I have the following task which is running on a Local Dask Executor. The server on which the agent is running is a different python environment from the one…
Newskooler
  • 3,973
  • 7
  • 46
  • 84
7
votes
3 answers

Using Dask's NEW to_sql for improved efficiency (memory/speed) or alternative to get data from dask dataframe into SQL Server Table

My ultimate goal is to use SQL/Python together for a project with too much data for pandas to handle (at least on my machine). So, I have gone with dask to: read in data from multiple sources (mostly SQL Server Tables/Views) manipulate/merge the…
David Erickson
  • 16,433
  • 2
  • 19
  • 35
7
votes
2 answers

Force dask to_parquet to write single file

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a…
Christian
  • 372
  • 3
  • 13
7
votes
6 answers

DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine

I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas. I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to…
anakaine
  • 1,188
  • 2
  • 14
  • 30