Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

combining tqdm with delayed execution with dask in python

tqdm and dask are both amazing packages for iterations in python. While tqdm implements the needed progress bar, dask implements the multi-thread platform and they both can make iteration process less frustrating. Yet - I'm having troubles to…

python dask tqdm

asked Jun 11 '17 at 12:37

Dimgold

2,748
5
26
49

votes

1 answer

Slicing a Dask Dataframe

I have the following code where I like to do a train/test split on a Dask dataframe df = dd.read_csv(csv_filename, sep=',', encoding="latin-1", names=cols, header=0, dtype='str') But when I try to do slices like for train,…

python dataframe dask

asked Jun 10 '17 at 16:17

Zubair Ahmed

votes

1 answer

Dask DataFrame aggregate to median

I am trying to aggregate a dask dataframe to set of metrics, including median, but it looks like that median is not supported. Any chance to aggregate and get median? st_agg = df.groupby(['start station id', 'end station…

python dask

asked May 05 '17 at 19:14

Philipp_Kats

3,872
3
27
44

votes

2 answers

Python Dask - vertical concatenation of 2 DataFrames

I am trying to vertically concatenate two Dask DataFrames I have the following Dask DataFrame: d = [ ['A','B','C','D','E','F'], [1, 4, 8, 1, 3, 5], [6, 6, 2, 2, 0, 0], [9, 4, 5, 0, 6, 35], [0, 1, 7, 10, 9, 4], [0, 7, 2, 6, 1,…

python-2.7 dataframe concatenation dask

asked May 05 '17 at 17:42

edesz

11,756
22
75
123

votes

1 answer

Dask, create a dataframe from several dask arrays

Suppose I have a set of dask arrays such as: c1 = da.from_array(np.arange(100000, 190000), chunks=1000) c2 = da.from_array(np.arange(200000, 290000), chunks=1000) c3 = da.from_array(np.arange(300000, 390000), chunks=1000) is it possible to create a…

python dask

asked Mar 28 '17 at 01:13

Jason Solack

votes

2 answers

How to apply a function to a dask dataframe and return multiple values?

In pandas, I use the typical pattern below to apply a vectorized function to a df and return multiple values. This is really only necessary when the said function produces multiple independent outputs from a single task. See my overly trivial…

python pandas dask

asked Jan 18 '17 at 20:07

Jasper1918

votes

1 answer

does npartitions influence the result of dask.dataframe.head()?

When running the following code, the result of dask.dataframe.head() depends on npartitions: import dask.dataframe as dd import pandas as pd df = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]}) ddf = dd.from_pandas(df, npartitions =…

python pandas dask

asked Jul 09 '16 at 03:58

Arco Bast

3,595
2
26
53

votes

1 answer

dask df.col.unique() vs df.col.drop_duplicates()

In dask what is the difference between df.col.unique() and df.col.drop_duplicates() Both return a series containing the unique elements of df.col. There is a difference in the index, unique result is indexed by 1..N while drop_duplicates indexed…

dask

asked Mar 07 '16 at 06:12

Daniel Mahler

7,653
5
51
90

votes

1 answer

dask computation not executing in parallel

I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples. import…

python concurrency python-multiprocessing dask castra

asked Feb 19 '16 at 22:31

Daniel Mahler

7,653
5
51
90

votes

3 answers

dask bag not using all cores? alternatives?

I have a python script which does the following: i. which takes an input file of data (usually nested JSON format) ii. passes the data line by line to another function which manipulates the data into desired format iii. and finally it writes the…

python json parallel-processing export-to-csv dask

asked Dec 03 '15 at 19:09

tamjd1

votes

6 answers

Keep indices in Pandas DataFrame with a certain number of non-NaN entires

Lets say I have the following dataframe: df1 = pd.DataFrame(data = [1,np.nan,np.nan,1,1,np.nan,1,1,1], columns = ['X'], index = ['a', 'a', 'a', 'b', 'b', 'b', …

python pandas dask

asked Jan 05 '21 at 00:50

hm8

1,381
3
21
41

votes

1 answer

How to execute a prefect Flow on a docker image?

My goal: I have a built docker image and want to run all my Flows on that image. Currently: I have the following task which is running on a Local Dask Executor. The server on which the agent is running is a different python environment from the one…

docker etl dask docker-image prefect

asked Oct 07 '20 at 15:17

Newskooler

3,973
7
46
84

votes

3 answers

Using Dask's NEW to_sql for improved efficiency (memory/speed) or alternative to get data from dask dataframe into SQL Server Table

My ultimate goal is to use SQL/Python together for a project with too much data for pandas to handle (at least on my machine). So, I have gone with dask to: read in data from multiple sources (mostly SQL Server Tables/Views) manipulate/merge the…

sql-server pandas sqlalchemy dask dask-to-sql

asked Jun 16 '20 at 08:44

David Erickson

16,433
2
19
35

votes

2 answers

Force dask to_parquet to write single file

When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a…

python pandas dask parquet

asked Apr 08 '20 at 19:19

Christian

votes

6 answers

DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine

I'm using Dask to read in a 10m row csv+ and perform some calculations. So far it's proving to be 10x faster than Pandas. I have a piece of code, below, that when used with pandas works fine, but with dask throws a type error. I am unsure of how to…

python pandas numpy dask

asked Oct 06 '19 at 04:33

anakaine

1,188
2
14
30

Prev 1 2 3

…

99 100 Next