How to read sheet names of excel sheet from S3 in AWS Wrangler?

Question

I have an excel sheet which is placed in S3 and I want to read sheet names of excel sheet. I have read excel sheet with aws wrangler using awswrangler.s3.read_excel(path) How can I read sheetnames using AWS Wrangler using Python?

score 0 · Answer 1 · answered Sep 09 '22 at 20:16

I'm not sure you can in Wrangler, or at least I haven't been able to figure it out. You can use Wrangler to download the sheet to a temporary file, then use pyxlsb/openpyxl (using both to cover all formats):

from openpyxl import load_workbook
from pyxlsb import open_workbook
import awswrangler as wr
import os
import pandas as pd

s3_src = 's3://bucket/folder/workbook.xlsb'
filename = os.path.basename(s3_src)

wr.s3.download(path=s3_src, local_file=filename)

if filename.endswith('.xlsb'):
    workbook = open_workbook(filename)
    sheets = workbook.sheets
else:
    workbook = load_workbook(filename)
    sheets = workbook.sheetnames

# Load all sheets into an array of dataframes
dfs = [pd.read_excel(filename, sheet_name=s) for s in sheets]

# Or now that you have the sheet names, load using Wrangler
dfs = [wr.s3.read_excel(s3_src, sheet_name=s) for s in sheets]

You could extract the names of the sheets & pass them as inputs to another process that does the extraction.

score 0 · Answer 2 · answered Jun 21 '23 at 17:39

0

Wrangler does not support this. Pandas does, but pandas does not play nice with aws_lambda.

import io
import boto3
import openpyxl

session = boto3.Session()
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=my_bucket, Key=my_key)  
body = obj['Body'].read()
wb = openpyxl.load_workbook(io.BytesIO(body))
sheet_names = wb.get_sheet_names()

answered Jun 21 '23 at 17:39

DaveP

259
4
16

What does this mean? "pandas does not play nice with aws_lambda" If they have the `awswrangler` layer available in their Lambda function, then they also have a working version of Pandas available in their Lambda function. – Mark B Jun 21 '23 at 17:43
What I meant is that lambda layers do not work (or not easily?) for compiled packages like pandas. When sheet_name=None in pandas it returns all Excel sheets and their df. When sheet_name=None in Wrangler it returns the first sheet. So you have to know the sheet names in advance in order to use Wrangler this way. – DaveP Jun 22 '23 at 14:08
So Wrangler works different than plain Pandas? Is that what you are trying to point out? I still don't get how that has anything to do with AWS Lambda layers specifically. If you use the Wrangler package locally in a python script, completely outside of Lambda, wouldn't Wrangler still work the same? You seem to be implying that compiling packages into Lambda layers somehow changes their behavior, or causes them to be buggy, which just isn't the case. – Mark B Jun 22 '23 at 14:59

score -1 · Answer 3 · answered Feb 07 '22 at 13:12

-1

According to the awswrangler docs of the read_excel() function:

This function accepts any Pandas’s read_excel() argument.

And in pandas:

sheet_name : str, int, list, or None, default 0

so you could try something like this:

import awswrangler as wr
wr.s3.read_excel(file_uri,sheet_name=your_sheet)

I am currently facing a similar problem in AWS Glue, but did not manage to get it working yet.

answered Feb 07 '22 at 13:12

Kacper Starościak

1
1

Question is, "How can I read sheetnames using AWS Wrangler?" – DaveP Jun 21 '23 at 16:03

How to read sheet names of excel sheet from S3 in AWS Wrangler?

3 Answers3