I have managed to read in all the files and create the tables manually based on the JSON schema. However, I am unsure how to do this dynamically, i.e. if any changes to the json files the values are read in automatically based on the json schema.
Manually:
df=spark.read.json("path/*millionsofjson.json")
Reviewed nested schema
df.printSchema
Reading in Metadata Table
df_table1=df.select("metadata")
df_table1_select=("df_table1.column1","df_table1.column2"..."df_table1.column20")
df_table1_select.show()
Reading in Orders Table
df_table2=df.select("metadata")
df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column50")
df_table2_select.show()
Reading in Sales Table
df_table2=df.select("metadata")
df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column35")
df_table2_select.show()
Hopefully, this explains what I am after...