Questions tagged [dask]

Dask is a parallel computing and data analytics library for Python. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is open source and freely available. It is developed in coordination with other community projects like Numpy, Pandas, and Scikit-Learn. It supports dynamic task scheduling optimized for computation as well as big data collections.

Dask is composed of two components:

Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
“Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of the dynamic task schedulers.

Install: https://docs.dask.org/en/latest/install.html

Docs: https://docs.dask.org/

GitHub: https://github.com/dask/dask-tutorial

Main Page: https://dask.org/

4440 questions

votes

1 answer

Apply json.loads for a column of dataframe with dask

I have a dataframe fulldb_accrep_united of such kind: SparkID ... Period 0 913955 ... {"@PeriodName": "2000", "@DateBegin": "2000-01... 1 913955 ... {"@PeriodName": "1999", "@DateBegin":…

asked May 20 '20 at 17:24

Gimbo

votes

1 answer

Is concat in dask dataframe lazy operation?

I'm reading a list of files using dask read_parquet and concatenate those data frames and writing to some file. during the concatenate, does dask read's all the data in to memory while concatenating or it is loading only schema's, concatenate(I'm…

python pandas dask dask-delayed dask-dataframe

asked May 20 '20 at 14:00

Learnis

votes

1 answer

Running df.apply, dask and pd.get_dummies together

I have multiple categorical columns with millions of distinct values in these categorical columns. So, I am using dask and pd.get_dummies for converting these categorical columns into bit vectors. Like this: import pandas as pd import numpy as…

pandas apply dask

asked May 20 '20 at 01:05

learner

votes

1 answer

Merge large datasets with dask

I have two datasets, one is around 45GB and it contains daily transactions for 1 year and the second one is 3.6GB and contains customer IDs and details. I want to merge the two together on a common column to create a a single dataset, which exceeds…

pandas dask large-data

asked May 17 '20 at 20:08

Dr.Fykos

votes

1 answer

Dask dataframe containing json format column

I have a dask dataframe containing a column in json format, and I want to parse the column into dataframe format. the column in json format looks like: {"Name": {"id": 1000, "address": "ABC", ....}},,, So I want to extract only value of "Name", and…

python pandas dask

asked May 14 '20 at 01:41

SayZ

votes

3 answers

Python multiprocessing throws Killed: 9

I am trying to use multiprocessing to speed up a function where I tile 2000 arrays of shape (76, 76) into 3D arrays and apply a scaling factor. It works fine when the number of tiles is less than about 200 but I get a Killed: 9 when it's greater…

python numpy multiprocessing dask numba

asked May 12 '20 at 03:56

Joe Flip

1,076
4
21
37

votes

2 answers

Would it make sense to use Snakemake and Dask together?

I have a Snakemake workflow that I've been using to train DL TensorFlow models. At a high level there are a few longish-running jobs (model training) that can be run in parallel. I would like to run these on the cloud and dask-cloudprovider seems…

python amazon-web-services dask snakemake

asked May 05 '20 at 22:22

j sad

1,055
9
16

votes

0 answers

Dask dataframe index_col

I am dealing with large (20-100GB) tab delimited text files which I am able to import correctly into pandas with the index_col=False option. The Dask dataframe does not support the index_col parameter. I am able to work around, but curious if there…

dask

asked Apr 18 '20 at 00:04

user2234151

votes

2 answers

how to load and process zarr files using dask and xarray

I have monthly zarr files in s3 that have gridded temperature data. I would like to pull down multiple months of data for one lat/lon and create a dataframe of that time series. Some pseudo code: datasets=[] for file in files: s3 =…

python dask python-xarray zarr

asked Apr 16 '20 at 22:15

David

votes

1 answer

dask.delayed KeyError with distributed scheduler

I have a function interpolate_to_particles written in c and wrapped with ctypes. I want to use dask.delayed to make a series of calls to this function. The code runs successfully without dask # Interpolate w/o dask result =…

ctypes dask keyerror dask-distributed dask-delayed

asked Apr 06 '20 at 20:09

elltrain

votes

1 answer

Easy way to print out dask series/dataframe?

In pandas, there are lots of methods like head, tail, loc, iloc that can be used to see the data inside, but whenever I call one of these methods on dask, all I get is: Dask DataFrame Structure: Close npartitions=1 bool …

python dask

asked Apr 04 '20 at 22:36

Biarys

1,065
1
10
22

votes

1 answer

Dask's parallel for loop slower than single core

What I've tried I have an embarrassingly parallel for loop in which I iterate over 90x360 values in two nested for loops and do some computation. I tried dask.delayed to parallelize the for loops as per this tutorial although it is demonstrated for…

python multithreading multiprocessing dask

asked Apr 02 '20 at 22:54

Light_B

1,660
1
14
28

votes

0 answers

Issue when computing/merging dask dataframe(s) when index is categorical

I'm trying to use dask to process a dataset which does not fit into memory. It's time series data for various "IDs". After reading dask documentation, I chose to use the "parquet" file format and partitioning by "ID". However, while reading from…

python dask dask-dataframe

asked Apr 02 '20 at 13:58

Kalendil

votes

1 answer

Count no. of rows from large parquet files using dask without memory errors

I have 20 parquet files each of size about 5GB. I want to count no. of records in the whole dataset. I have the current code: from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=8, threads_per_worker=1) client =…

python pandas dask parquet

asked Apr 01 '20 at 14:48

Nihhaar

votes

1 answer

Can I use dask.delayed on a function wrapped with ctypes?

The goal is to use dask.delayed to parallelize some 'embarrassingly parallel' sections of my code. The code involves calling a python function which wraps a c-function using ctypes. To understand the errors I was getting I wrote a very basic…

python ctypes dask dask-delayed

asked Mar 30 '20 at 23:45

elltrain

Prev 1 2 3

…

99 100 Next