Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
0 answers

How to groupby or resample by a specific number of rows -- using Dask (Python)

I'm trying to downsample Dask dataframes by any x number of rows. For instance, if I was using datetimes as an index, I could just use: df = df.resample('1h').ohlc() But I don't want to resample by datetimes, I want to resample by a fixed number of…
3
votes
3 answers

NoModuleFoundError: No module named 'distributed'

I'm trying to implement dask on a cluster that uses SLURM. The client is successfully created and scaled, however, at the line with joblib.parallel_backend('dask'): the operation gets the worker timeout error and I get the following error from the…
rgswope
  • 115
  • 1
  • 9
3
votes
1 answer

Pandas dataframes too large to append to dask dataframe?

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already…
jb4earth
  • 198
  • 1
  • 6
3
votes
1 answer

Dask: Update published dataset periodically and pull data from other clients

I would like to append data on a published dask dataset from a queue (like redis). Then other python programs would be able to fetch the latest data (e.g. once per second/minute) and do some futher opertions. Would that be possible? Which append…
gies0r
  • 4,723
  • 4
  • 39
  • 50
3
votes
1 answer

Dask equivalent to pandas.DataFrame.update

I have a few functions that are using pandas.DataFrame.update method, and I'm trying to move into using Dask instead for the datasets, but the Dask Pandas API doesn't have the update method implemented. Is there an alternative way to get the same…
mlenthusiast
  • 1,094
  • 1
  • 12
  • 34
3
votes
2 answers

Why dask's read_sql_table requires a index_col parameter?

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter. I read at the documentation that if…
3
votes
1 answer

Use Dask to Drop Highly Correlated Pairwise Features in Dataframe?

Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my…
wildcat89
  • 1,159
  • 16
  • 47
3
votes
2 answers

Why is the compute() method slow for Dask dataframes but the head() method is fast?

So I'm a newbie when it comes to working with big data. I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right…
Faisal
  • 159
  • 9
3
votes
0 answers

dask-ml LinearRegression on multidimensional dask arrays

I am trying to perform multivariate linear regression on array data that is larger than memory. I am wondering how I should iterate a dask_ml linear regression function on a multidimensional dask array. On small enough data, I can use…
TomNorway
  • 2,584
  • 1
  • 19
  • 26
3
votes
1 answer

Pandas between_time equivalent for Dask DataFrame

I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date. In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00"), say. Because…
Sargera
  • 265
  • 2
  • 11
3
votes
1 answer

Parallelizing a Dask aggregation

Building off of this post, I implemented the custom mode formula, but have found issues with performance on this function. Essentially, when I enter into this aggregation, my cluster only uses one of my threads, which is not great for performance. I…
3
votes
2 answers

Large csv to parquet using Dask - OOM

I've 7 csv files with 8 GB each and need to convert to parquet. Memory usage goes to 100 GB and I had to kill it . I tried with Distributed Dask as well . The memory is limited to 12 GB but no output produced for long time. FYI. I used to…
3
votes
1 answer

How to select all rows from a Dask dataframe with value equal to minimal value of group

So I have following dask dataframe grouped by Problem column. | Problem | Items | Min_Dimension | Max_Dimension | Cost | |-------- |------ |---------------|-------------- |------ | | A | 7 | 2 | 15 | 23 …
Pieterism
  • 43
  • 5
3
votes
0 answers

How to perform a set_index on a large dask dataframe and avoid workers to be killed?

I've understood it's really compute intensive to perform a set_index. When reading the documentation, it's said that's the kind of operation we would like to avoid and to perform directly after ingesting data if needed. I currently have parquet…
DavidK
  • 2,495
  • 3
  • 23
  • 38
3
votes
1 answer

Computing multiple dask.dataframe.from_delayed() from one source

How can I compute .from_delayed() in parallel from one sequence of delayed? def foo(): df1, df2 = ... # prepare two pd.DataFrame() in one foo() call return df1, df2 dds = [dask.delayed(foo)() for _ in range(5)] # 5 delayed pairs (df1,…
Ilya
  • 31
  • 1