Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

How to create unique index in Dask DataFrame?

Imagine I have a Dask DataFrame from read_csv or created another way. How can I make a unique index for the dask dataframe? Note: reset_index builds a monotonically ascending index in each partition. That means (0,1,2,3,4,5,... ) for Partition…

asked Jun 06 '19 at 10:54

Spar

votes

4 answers

Effective-Date-Range One-Hot-Encode groupby

Starting with this sample data... import pandas as pd start_data = {"person_id": [1, 1, 1, 1, 2], "nid": [1, 2, 3, 4, 1], "beg": ["Jan 1 2018", "Jan 5 2018", "Jan 10 2018", "Feb 5 2018", "Jan 25 2018"], "end": ["Feb 1…

python pandas pandas-groupby dask

asked May 30 '19 at 20:16

Chris Farr

3,580
1
21
24

votes

4 answers

How to speed up nested cross validation in python?

From what I've found there is 1 other question like this (Speed-up nested cross-validation) however installing MPI does not work for me after trying several fixes also suggested on this site and microsoft, so I am hoping there is another package or…

python parallel-processing scikit-learn dask cross-validation

asked Apr 23 '19 at 09:49

DN1

votes

5 answers

Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution. I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with…

pandas google-cloud-platform google-bigquery bigdata dask

asked Mar 06 '19 at 23:05

MT467

votes

1 answer

module 'dask' has no attribute 'read_fwf'

I want to use dask.read_fwf(file), but I get there error AttributeError: module 'dask' has no attribute 'read_fwf' The same problem occurs for read_csv and read_table. I have uninstalled and reinstalled dask, as well as trying to rename my 'csv.py'…

python-3.x dask

asked Feb 26 '19 at 17:32

Phil

votes

1 answer

Dask Memory Error when running df.to_csv()

I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is: cluster = LocalCluster(n_workers=6,…

python pandas dask

asked Jan 31 '19 at 10:59

D.Griffiths

2,248
3
16
30

votes

1 answer

Parallel Sklearn Model Building with Dask or Joblib

I have a large set of sklearn pipelines that I'd like to build in parallel with Dask. Here's a simple but naive sequential approach: from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from…

python scikit-learn dask dask-distributed

asked Jan 24 '19 at 21:01

slaw

6,591
16
56
109

votes

1 answer

Updating the values of a column in a dask dataframe based on some condition on some other columns

We have a very large CSV file which has been imported as a dask dataframe. I make a small example to explain the question. import dask.dataframe as dd df = dd.read_csv("name and path of the file.csv") df.head() output col1 | col2 | col3 | col4 22 …

python dataframe conditional-statements dask

asked Jan 22 '19 at 06:50

Monirrad

votes

1 answer

How can a dask worker access the total number of workers currently in the cluster?

My dask workers need to run init code that depends on the number of workers in the cluster. Can workers access such cluster metadata?

python dask dask-distributed

asked Jan 03 '19 at 18:53

Randy Gelhausen

votes

1 answer

How to view Dask dashboard when running on a virtual machine?

Here is what I'm doing now: From my Windows laptop, I SSH into Linux server via Putty: IP address is 11.11.11.111 Start up Jupyter notebook: nohup jupyter notebook --ip=0.0.0.0 --no-browser & Terminal output shows Jupyter notebook is running…

linux python-3.x centos virtual-machine dask

asked Jan 01 '19 at 21:27

Korean_Of_the_Mountain

1,428
3
16
40

votes

1 answer

Excessive memory usage when using dask dataframe created from parquet file

I have parquet file that is 800K rows x 8.7K columns. I loaded it into a dask dataframe: import dask.dataframe as dd dask_train_df = dd.read_parquet('train.parquet') dask_train_df.info() This yields:

parquet dask

asked Dec 24 '18 at 21:33

Goodwin Chen

votes

1 answer

How to set index on categorical type?

Given this Dask DataFrame : Dask DataFrame Structure: date value symbol npartitions=2 object int64 category[known] ... ... ... ... Dask Name:…

python pandas dataframe dask

asked Nov 24 '18 at 17:02

Ghislain Viguier

votes

1 answer

Dask how to pivot DataFrame

I am using the code below but get an error after pivoting the DataFrame: dataframe: name day value time 0 MAC000002 2012-12-16 0.147 09:30:00 1 MAC000002 2012-12-16 0.110 10:00:00 2 MAC000002 2012-12-16 0.736 …

python pandas dataframe pivot-table dask

asked Oct 14 '18 at 13:36

proximacentauri

1,749
5
25
53

votes

1 answer

ValueError: The columns in the computed data do not match the columns in the provided metadata

I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas. Here comes in dask. Dask is fast but with many errors. This is a snippet of the code, #drop some columns df =…

python python-3.x dask

asked Sep 09 '18 at 20:52

acacia

1,375
1
14
40

votes

1 answer

Multiple aggregation user defined functions in Dask dataframe

I'm processing a data set using Dask (considering it doesn't fit in memory) and I want to group the instances with a different aggregating function depending on the column and it's type. Dask has a set of default aggregation functions for numerical…

python dataframe group-by aggregation dask

asked Sep 03 '18 at 12:23

GRoutar

1,311
1
15
38

Prev 1 2 3

…

99 100 Next