5

Hello All the examples that I came across for using dask thus far has been multiple csv files in a folder being read using dask read_csv call.

if I am provided an xlsx file with multiple tabs, can I use anything in dask to read them parallely?

P.S. I am using pandas 0.19.2 with python 2.7

schuler
  • 175
  • 2
  • 4
  • 12
  • 1
    You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's `delayed` function. Are you wanting to process all the tabs as a single data-frame? – mdurant Jun 20 '17 at 14:50
  • 1
    This notebook may be of interest: https://gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca – MRocklin Jun 21 '17 at 04:50

2 Answers2

5

For those using Python 3.6:

#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed

parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)

print(df.head())

I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.

zorze
  • 198
  • 1
  • 6
3

A simple example

fn = 'my_file.xlsx'
parts = [dask.delayed(pd.read_excel)(fn, i, **other_options) 
         for i in range(number_of_sheets)]
df = dd.from_delayed(parts, meta=parts[0].compute())

Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.

Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.

mdurant
  • 27,272
  • 5
  • 45
  • 74