Parallel excel sheet read from dask

Question

Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call.

if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them parallely?

P.S. I am using pandas 0.19.2 with python 2.7

You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's `delayed` function. Are you wanting to process all the tabs as a single data-frame? — mdurant, Jun 20 '17 at 14:50
This notebook may be of interest: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca — MRocklin, Jun 21 '17 at 04:50

score 5 · Answer 1 · answered Nov 23 '18 at 11:25

5

For those using Python 3.6:

#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed

parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)

print(df.head())

I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.

answered Nov 23 '18 at 11:25

zorze

198
1
6

1

Unfortunately, I dont see any improvement. – Dinh Quang Tuan May 23 '21 at 10:36

mdurant · Answer 2 · 2019-02-05T04:41:56.933

3

A simple example

fn = 'my_file.xlsx'
parts = [dask.delayed(pd.read_excel)(fn, i, **other_options) 
         for i in range(number_of_sheets)]
df = dd.from_delayed(parts, meta=parts[0].compute())

Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.

Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.

edited Feb 05 '19 at 04:41

answered Jun 21 '17 at 14:55

mdurant

27,272
5
45
74

Hi @mdurant, thanks for this answer! Did you mean to include brackets: `[parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)]`. – edesz Feb 05 '19 at 03:25
Aye, I suppose so – mdurant Feb 05 '19 at 04:42
And I meant to use `parts = [dask....`, not my typo `[parts = dask...`. – edesz Feb 05 '19 at 06:01

Parallel excel sheet read from dask

2 Answers2

Linked