Schema changing and handling column datatype errors during .xls ETL with Python in AWS

Question

I've been working with .xls files for almost 2 years and I finally decided to clear my thoughts over this topic. Let me explain my scenario: I work with many sources (databases, api, ...), but I also have to read almost 20 excel files from the business areas to plot in Power BI, all the ETL is done within AWS cloud.

Should I hava .JSON file (or hard coded) that sets the datatype of each column for each .xls file? I mean, there are a few sheets that have almost 80 columns, so I was wondering if one keeps the datatype of each of them stored somewhere or just read it using Spark/Pandas and leave the way it was read by the framework without manipulating the schema of each column (that's what I've been doing so far)

My second questions is how to handle business people errors like setting varchar in a number column? I work by using AWS Glue and Lambda:

Lambda reads .xls from Sharepoint and transforms into .parquet/.csv file.
Glue job then reads the file and creates the tables to be read by Athena.

In the meantime, if someone put a string in a numeric column in one of the sheets, the Lambda will generate a .parquet file in which the previous numeric column will now be varchar and the views created in Athena will fail because of this. Any ideas over these two questions?

Schema changing and handling column datatype errors during .xls ETL with Python in AWS

0 Answers0