I have several different datasets on a datalake (in JSON format). This is landing data from an ingestion process.
I am using PySpark notebook to load the data from Landing to Staging where it will be in Parquet files. Part of this process is to ensure the datatypes are corrects.
I want to load a predetermined Schema of each dataset in PySpark, so that I can use the Notebook for more than 1 dataset (parameterised).
I want to be able to create a "Schema File" on the lake, then load it into a Schema object in PySpark and load the dataframe from the files on the lake using that Schema Object.
#schema = LoadFromFile(varSchema)
df = spark.read.load(varLanding, format='json', schema=dataSchema)
display(df.limit(5))