Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

0 answers

How to groupby or resample by a specific number of rows -- using Dask (Python)

I'm trying to downsample Dask dataframes by any x number of rows. For instance, if I was using datetimes as an index, I could just use: df = df.resample('1h').ohlc() But I don't want to resample by datetimes, I want to resample by a fixed number of…

asked Aug 28 '20 at 04:19

zippycorners

votes

3 answers

NoModuleFoundError: No module named 'distributed'

I'm trying to implement dask on a cluster that uses SLURM. The client is successfully created and scaled, however, at the line with joblib.parallel_backend('dask'): the operation gets the worker timeout error and I get the following error from the…

python dask joblib

asked Aug 27 '20 at 23:51

rgswope

votes

1 answer

Pandas dataframes too large to append to dask dataframe?

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already…

python pandas dataframe jupyter dask

asked Aug 04 '20 at 17:28

jb4earth

votes

1 answer

Dask: Update published dataset periodically and pull data from other clients

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions. Would that be possible? Which append…

dask dask-distributed dask-dataframe

asked Jul 29 '20 at 15:05

gies0r

4,723
4
39
50

votes

1 answer

Dask equivalent to pandas.DataFrame.update

I have a few functions that are using pandas.DataFrame.update method, and I'm trying to move into using Dask instead for the datasets, but the Dask Pandas API doesn't have the update method implemented. Is there an alternative way to get the same…

python pandas dataframe dask

asked Jul 14 '20 at 17:43

mlenthusiast

1,094
1
12
34

votes

2 answers

Why dask's read_sql_table requires a index_col parameter?

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter. I read at the documentation that if…

dask dask-dataframe

asked Jul 09 '20 at 15:32

Thiago Dantas

votes

1 answer

Use Dask to Drop Highly Correlated Pairwise Features in Dataframe?

Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my…

python pandas dask

asked Jul 08 '20 at 23:47

wildcat89

1,159
16
47

votes

2 answers

Why is the compute() method slow for Dask dataframes but the head() method is fast?

So I'm a newbie when it comes to working with big data. I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right…

python pandas bigdata dask

asked Jul 01 '20 at 02:35

Faisal

votes

0 answers

dask-ml LinearRegression on multidimensional dask arrays

I am trying to perform multivariate linear regression on array data that is larger than memory. I am wondering how I should iterate a dask_ml linear regression function on a multidimensional dask array. On small enough data, I can use…

dask dask-ml

asked Jun 26 '20 at 19:40

TomNorway

2,584
1
19
26

votes

1 answer

Pandas between_time equivalent for Dask DataFrame

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date. In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00"), say. Because…

python pandas dask

asked Jun 18 '20 at 22:09

Sargera

votes

1 answer

Parallelizing a Dask aggregation

Building off of this post, I implemented the custom mode formula, but have found issues with performance on this function. Essentially, when I enter into this aggregation, my cluster only uses one of my threads, which is not great for performance. I…

python pandas dask dask-distributed dask-dataframe

asked Jun 12 '20 at 21:15

Brendon Gallagher

votes

2 answers

Large csv to parquet using Dask - OOM

I've 7 csv files with 8 GB each and need to convert to parquet. Memory usage goes to 100 GB and I had to kill it . I tried with Distributed Dask as well . The memory is limited to 12 GB but no output produced for long time. FYI. I used to…

dask

asked Jun 03 '20 at 20:28

Temple Jersey

votes

1 answer

How to select all rows from a Dask dataframe with value equal to minimal value of group

So I have following dask dataframe grouped by Problem column. | Problem | Items | Min_Dimension | Max_Dimension | Cost | |-------- |------ |---------------|-------------- |------ | | A | 7 | 2 | 15 | 23 …

python pandas pandas-groupby dask

asked May 28 '20 at 13:14

Pieterism

votes

0 answers

How to perform a set_index on a large dask dataframe and avoid workers to be killed?

I've understood it's really compute intensive to perform a set_index. When reading the documentation, it's said that's the kind of operation we would like to avoid and to perform directly after ingesting data if needed. I currently have parquet…

python dask

asked May 26 '20 at 11:55

DavidK

2,495
3
23
38

votes

1 answer

Computing multiple dask.dataframe.from_delayed() from one source

How can I compute .from_delayed() in parallel from one sequence of delayed? def foo(): df1, df2 = ... # prepare two pd.DataFrame() in one foo() call return df1, df2 dds = [dask.delayed(foo)() for _ in range(5)] # 5 delayed pairs (df1,…

python dask dask-delayed dask-dataframe

asked May 21 '20 at 15:51

Ilya

Prev 1 2 3

…

99 100 Next