Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

2 answers

Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

What are the fundamental difference and primary use-cases for Dask | Modin | Data.table I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations

asked Jun 06 '19 at 19:31

Shubham Samant

votes

2 answers

Dask Dataframe: Get row count?

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this? When I try to run dataframe.x.count().compute() it looks like it…

python dataframe dask

asked Mar 15 '18 at 21:27

usbToaster

votes

1 answer

Slow len function on dask distributed dataframe

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc. import dask.dataframe as dd from dask.distributed import Client client =…

python performance dataframe dask

asked Jan 27 '17 at 19:58

JuanPabloMF

votes

3 answers

How to read a compressed (gz) CSV file into a dask Dataframe?

Is there a way to read a .csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data.gz" ) but get an unicode error (probably because it is interpreting the compressed…

python csv pandas dask

asked Oct 07 '16 at 19:27

Magellan88

2,543
3
24
36

votes

3 answers

Can I use functions imported from .py files in Dask/Distributed?

I have a question about serialization and imports. should functions have their own imports? like I've seen done with PySpark Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared…

python distributed-computing dask

asked Sep 02 '16 at 14:46

Albert DeFusco

votes

2 answers

Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully? Traceback is like below: ValueError: Mismatched dtypes found in…

python dataframe dask

asked Sep 24 '18 at 20:11

Coffey Liu

votes

4 answers

R equivalent of Python's dask

Is there an equivalent package in R to Python's dask? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine. Link to Python's Dask page: https://dask.pydata.org/en/latest/ From the Dask…

python r dask

asked Jun 27 '18 at 18:24

Adam Bricknell

votes

2 answers

Dask dataframe split partitions based on a column or function

I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume …

python pandas dataframe dask dask-distributed

asked Mar 28 '18 at 11:02

Roger Thomas

votes

1 answer

Why is dask read_csv from s3 keeping so much memory?

I'm reading in some gzipped data from s3, using dask (a replacement for a SQL query). However, it looks like there is some caching of the data file, or unzipped file somewhere that keeps in system memory. NB this should be runnable, the test data…

python pandas csv amazon-s3 dask

asked Feb 23 '18 at 18:23

jeremycg

24,657
5
63
74

votes

2 answers

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df =…

python pandas dask

asked Feb 13 '17 at 19:59

ML_Passion

1,031
3
15
33

votes

2 answers

How to speed up import of large xlsx files?

I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import…

python pandas openpyxl dask xlrd

asked Apr 20 '19 at 22:30

pythoneer

votes

3 answers

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint…

python postgresql pandas dask pandas-to-sql

asked Jan 24 '19 at 16:56

Ludo

2,307
2
27
58

votes

3 answers

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask…

python pandas dataframe memory dask

asked Apr 26 '18 at 20:18

OverflowingTheGlass

2,324
1
27
75

votes

3 answers

Parallelizing loading data from MongoDB into python

All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame. I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or…

python mongodb pandas parallel-processing dask

asked May 19 '17 at 15:04

wl2776

4,099
4
35
77

votes

3 answers

dask DataFrame equivalent of pandas DataFrame sort_values

What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead. Would the equivalent be : ddf.set_index([col1, col2], sorted=True) ?

python dataframe sorting dask

asked Nov 02 '16 at 09:28

femibyte

3,317
7
34
59

Prev 1 2 3

…

99 100 Next