1

I am trying to import an excel file with multiple sheets. Based on what I read Glue 2.0 can read excel files. I have tried this code and the job was successful but I am lost as to how I am supposed to run crawlers for Data Catalog, I cannot seem to find the destination.

Am I missing anything from this code?

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pandas as pd

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)


excel_path= r"s3://input/employee.xlsx"
df_xl_op = pd.read_excel(excel_path,sheet_name = "Sheet1")
df=df_xl_op.applymap(str)
input_df = spark.createDataFrame(df)
input_df.printSchema()

job.commit()
nehlo_kimchen
  • 39
  • 1
  • 7
  • You have to write/persist input_df to S3 then create a Glue crawler with this S3 path as input. Once the crawler is ran then you should see the table created in Glue catalog. – Prabhakar Reddy Jul 08 '22 at 04:44

0 Answers0