Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
6
votes
2 answers

How to create unique index in Dask DataFrame?

Imagine I have a Dask DataFrame from read_csv or created another way. How can I make a unique index for the dask dataframe? Note: reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…
Spar
  • 463
  • 1
  • 5
  • 23
6
votes
4 answers

Effective-Date-Range One-Hot-Encode groupby

Starting with this sample data... import pandas as pd start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1], "beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"], "end": ["Feb 1…
Chris Farr
  • 3,580
  • 1
  • 21
  • 24
6
votes
4 answers

How to speed up nested cross validation in python?

From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or…
DN1
  • 234
  • 1
  • 13
  • 38
6
votes
5 answers

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution. I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with…
MT467
  • 668
  • 2
  • 15
  • 31
6
votes
1 answer

module 'dask' has no attribute 'read_fwf'

I want to use dask.read_fwf(file), but I get there error AttributeError: module 'dask' has no attribute 'read_fwf' The same problem occurs for read_csv and read_table. I have uninstalled and reinstalled dask, as well as trying to rename my 'csv.py'…
Phil
  • 129
  • 1
  • 9
6
votes
1 answer

Dask Memory Error when running df.to_csv()

I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6,…
D.Griffiths
  • 2,248
  • 3
  • 16
  • 30
6
votes
1 answer

Parallel Sklearn Model Building with Dask or Joblib

I have a large set of sklearn pipelines that I'd like to build in parallel with Dask. Here's a simple but naive sequential approach: from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from…
slaw
  • 6,591
  • 16
  • 56
  • 109
6
votes
1 answer

Updating the values of a column in a dask dataframe based on some condition on some other columns

We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question. import dask.dataframe as dd df = dd.read_csv("name and path of the file.csv") df.head() output col1 | col2 | col3 | col4 22 …
Monirrad
  • 465
  • 1
  • 7
  • 17
6
votes
1 answer

How can a dask worker access the total number of workers currently in the cluster?

My dask workers need to run init code that depends on the number of workers in the cluster. Can workers access such cluster metadata?
Randy Gelhausen
  • 125
  • 1
  • 5
6
votes
1 answer

How to view Dask dashboard when running on a virtual machine?

Here is what I'm doing now: From my Windows laptop, I SSH into Linux server via Putty: IP address is 11.11.11.111 Start up Jupyter notebook: nohup jupyter notebook --ip=0.0.0.0 --no-browser & Terminal output shows Jupyter notebook is running…
Korean_Of_the_Mountain
  • 1,428
  • 3
  • 16
  • 40
6
votes
1 answer

Excessive memory usage when using dask dataframe created from parquet file

I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe: import dask.dataframe as dd dask_train_df = dd.read_parquet('train.parquet') dask_train_df.info() This yields:
6
votes
1 answer

How to set index on categorical type?

Given this Dask DataFrame : Dask DataFrame Structure: date value symbol npartitions=2 object int64 category[known] ... ... ... ... Dask Name:…
Ghislain Viguier
  • 342
  • 4
  • 12
6
votes
1 answer

Dask how to pivot DataFrame

I am using the code below but get an error after pivoting the DataFrame: dataframe: name day value time 0 MAC000002 2012-12-16 0.147 09:30:00 1 MAC000002 2012-12-16 0.110 10:00:00 2 MAC000002 2012-12-16 0.736 …
proximacentauri
  • 1,749
  • 5
  • 25
  • 53
6
votes
1 answer

ValueError: The columns in the computed data do not match the columns in the provided metadata

I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas. Here comes in dask. Dask is fast but with many errors. This is a snippet of the code, #drop some columns df =…
acacia
  • 1,375
  • 1
  • 14
  • 40
6
votes
1 answer

Multiple aggregation user defined functions in Dask dataframe

I'm processing a data set using Dask (considering it doesn't fit in memory) and I want to group the instances with a different aggregating function depending on the column and it's type. Dask has a set of default aggregation functions for numerical…
GRoutar
  • 1,311
  • 1
  • 15
  • 38