Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
16
votes
2 answers

Comparison between Modin | Dask | Data.table | Pandas for parallel processing and out of memory csv files

What are the fundamental difference and primary use-cases for Dask | Modin | Data.table I checked the documentation of each libraries, all of them seem to offer a 'similar' solution to pandas limitations
Shubham Samant
  • 171
  • 1
  • 5
16
votes
2 answers

Dask Dataframe: Get row count?

Simple question: I have a dataframe in dask containing about 300 mln records. I need to know the exact number of rows that the dataframe contains. Is there an easy way to do this? When I try to run dataframe.x.count().compute() it looks like it…
usbToaster
  • 475
  • 1
  • 4
  • 13
16
votes
1 answer

Slow len function on dask distributed dataframe

I have been testing how to use dask (cluster with 20 cores) and I am surprised by the speed that I get on calling a len function vs slicing through loc. import dask.dataframe as dd from dask.distributed import Client client =…
JuanPabloMF
  • 397
  • 1
  • 3
  • 14
16
votes
3 answers

How to read a compressed (gz) CSV file into a dask Dataframe?

Is there a way to read a .csv file that is compressed via gz into a dask dataframe? I've tried it directly with import dask.dataframe as dd df = dd.read_csv("Data.gz" ) but get an unicode error (probably because it is interpreting the compressed…
Magellan88
  • 2,543
  • 3
  • 24
  • 36
16
votes
3 answers

Can I use functions imported from .py files in Dask/Distributed?

I have a question about serialization and imports. should functions have their own imports? like I've seen done with PySpark Is the following just plain wrong? Does mod.py need to be a conda/pip package? mod.py was written to a shared…
Albert DeFusco
  • 163
  • 1
  • 4
15
votes
2 answers

Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully? Traceback is like below: ValueError: Mismatched dtypes found in…
Coffey Liu
  • 327
  • 1
  • 3
  • 6
15
votes
4 answers

R equivalent of Python's dask

Is there an equivalent package in R to Python's dask? Specifically for running Machine Learning algorithms on larger-than-memory data sets on a single machine. Link to Python's Dask page: https://dask.pydata.org/en/latest/ From the Dask…
Adam Bricknell
  • 161
  • 1
  • 6
15
votes
2 answers

Dask dataframe split partitions based on a column or function

I have recently begun looking at Dask for big data. I have a question on efficiently applying operations in parallel. Say I have some sales data like this: customerKey productKey transactionKey grossSales netSales unitVolume …
Roger Thomas
  • 822
  • 1
  • 7
  • 17
15
votes
1 answer

Why is dask read_csv from s3 keeping so much memory?

I'm reading in some gzipped data from s3, using dask (a replacement for a SQL query). However, it looks like there is some caching of the data file, or unzipped file somewhere that keeps in system memory. NB this should be runnable, the test data…
jeremycg
  • 24,657
  • 5
  • 63
  • 74
15
votes
2 answers

Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls: DataFrame definition import pandas as pd import dask.dataframe as dd df =…
ML_Passion
  • 1,031
  • 3
  • 15
  • 33
14
votes
2 answers

How to speed up import of large xlsx files?

I want to process a large 200MB Excel (xlsx) file with 15 sheets and 1 million rows with 5 columns each) and create a pandas dataframe from the data. The import of the Excel file is extremely slow (up to 10minutes). Unfortunately, the Excel import…
pythoneer
  • 403
  • 2
  • 4
  • 15
14
votes
3 answers

duplicate key value violates unique constraint - postgres error when trying to create sql table from dask dataframe

Following on from this question, when I try to create a postgresql table from a dask.dataframe with more than one partition I get the following error: IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint…
Ludo
  • 2,307
  • 2
  • 27
  • 58
14
votes
3 answers

Avoiding Memory Issues For GroupBy on Large Pandas DataFrame

Update: The pandas df was created like this: df = pd.read_sql(query, engine) encoded = pd.get_dummies(df, columns=['account']) Creating a dask df from this df looks like this: df = dd.from_pandas(encoded, 50) Performing the operation with dask…
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
14
votes
3 answers

Parallelizing loading data from MongoDB into python

All documents in my collection in MongoDB have the same fields. My goal is to load them into Python into pandas.DataFrame or dask.DataFrame. I'd like to speedup the loading procedure by parallelizing it. My plan is to spawn several processes or…
wl2776
  • 4,099
  • 4
  • 35
  • 77
14
votes
3 answers

dask DataFrame equivalent of pandas DataFrame sort_values

What would be the equivalent of sort_values in pandas for a dask DataFrame ? I am trying to scale some Pandas code which has memory issues to use a dask DataFrame instead. Would the equivalent be : ddf.set_index([col1, col2], sorted=True) ?
femibyte
  • 3,317
  • 7
  • 34
  • 59