Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

  • Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
  • “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions
3
votes
1 answer

Apply json.loads for a column of dataframe with dask

I have a dataframe fulldb_accrep_united of such kind: SparkID ... Period 0 913955 ... {"@PeriodName": "2000", "@DateBegin": "2000-01... 1 913955 ... {"@PeriodName": "1999", "@DateBegin":…
Gimbo
  • 41
  • 6
3
votes
1 answer

Is concat in dask dataframe lazy operation?

I'm reading a list of files using dask read_parquet and concatenate those data frames and writing to some file. during the concatenate, does dask read's all the data in to memory while concatenating or it is loading only schema's, concatenate(I'm…
Learnis
  • 526
  • 5
  • 25
3
votes
1 answer

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this: import pandas as pd import numpy as…
learner
  • 857
  • 1
  • 14
  • 28
3
votes
1 answer

Merge large datasets with dask

I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds…
Dr.Fykos
  • 90
  • 1
  • 8
3
votes
1 answer

Dask dataframe containing json format column

I have a dask dataframe containing a column in json format, and I want to parse the column into dataframe format. the column in json format looks like: {"Name": {"id": 1000, "address": "ABC", ....}},,, So I want to extract only value of "Name", and…
SayZ
  • 101
  • 1
  • 5
3
votes
3 answers

Python multiprocessing throws Killed: 9

I am trying to use multiprocessing to speed up a function where I tile 2000 arrays of shape (76, 76) into 3D arrays and apply a scaling factor. It works fine when the number of tiles is less than about 200 but I get a Killed: 9 when it's greater…
Joe Flip
  • 1,076
  • 4
  • 21
  • 37
3
votes
2 answers

Would it make sense to use Snakemake and Dask together?

I have a Snakemake workflow that I've been using to train DL TensorFlow models. At a high level there are a few longish-running jobs (model training) that can be run in parallel. I would like to run these on the cloud and dask-cloudprovider seems…
j sad
  • 1,055
  • 9
  • 16
3
votes
0 answers

Dask dataframe index_col

I am dealing with large (20-100GB) tab delimited text files which I am able to import correctly into pandas with the index_col=False option. The Dask dataframe does not support the index_col parameter. I am able to work around, but curious if there…
user2234151
  • 131
  • 1
  • 6
3
votes
2 answers

how to load and process zarr files using dask and xarray

I have monthly zarr files in s3 that have gridded temperature data. I would like to pull down multiple months of data for one lat/lon and create a dataframe of that time series. Some pseudo code: datasets=[] for file in files: s3 =…
David
  • 181
  • 13
3
votes
1 answer

dask.delayed KeyError with distributed scheduler

I have a function interpolate_to_particles written in c and wrapped with ctypes. I want to use dask.delayed to make a series of calls to this function. The code runs successfully without dask # Interpolate w/o dask result =…
elltrain
  • 82
  • 4
3
votes
1 answer

Easy way to print out dask series/dataframe?

In pandas, there are lots of methods like head, tail, loc, iloc that can be used to see the data inside, but whenever I call one of these methods on dask, all I get is: Dask DataFrame Structure: Close npartitions=1 bool …
Biarys
  • 1,065
  • 1
  • 10
  • 22
3
votes
1 answer

Dask's parallel for loop slower than single core

What I've tried I have an embarrassingly parallel for loop in which I iterate over 90x360 values in two nested for loops and do some computation. I tried dask.delayed to parallelize the for loops as per this tutorial although it is demonstrated for…
Light_B
  • 1,660
  • 1
  • 14
  • 28
3
votes
0 answers

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID". However, while reading from…
Kalendil
  • 31
  • 1
3
votes
1 answer

Count no. of rows from large parquet files using dask without memory errors

I have 20 parquet files each of size about 5GB. I want to count no. of records in the whole dataset. I have the current code: from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=8, threads_per_worker=1) client =…
Nihhaar
  • 141
  • 11
3
votes
1 answer

Can I use dask.delayed on a function wrapped with ctypes?

The goal is to use dask.delayed to parallelize some 'embarrassingly parallel' sections of my code. The code involves calling a python function which wraps a c-function using ctypes. To understand the errors I was getting I wrote a very basic…
elltrain
  • 82
  • 4