Dynamically reading in 1m > JSON files into Pyspark

Question

I have managed to read in all the files and create the tables manually based on the JSON schema. However, I am unsure how to do this dynamically, i.e. if any changes to the json files the values are read in automatically based on the json schema.

Manually:

df=spark.read.json("path/*millionsofjson.json")

Reviewed nested schema

df.printSchema

Reading in Metadata Table

df_table1=df.select("metadata")

df_table1_select=("df_table1.column1","df_table1.column2"..."df_table1.column20")

df_table1_select.show()

Reading in Orders Table

df_table2=df.select("metadata")

df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column50")

df_table2_select.show()

Reading in Sales Table

df_table2=df.select("metadata")

df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column35")

df_table2_select.show()

Hopefully, this explains what I am after...

You have all files of various tables, mixed in one place, you create a single dataframe and then you create from it multiple tables, by picking the relevant fields for each table? — David דודו Markovitz, Mar 14 '22 at 05:21
Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community, Mar 15 '22 at 13:52

score 0 · Answer 1 · answered Mar 31 '22 at 13:07

One of the workaround may help you out to achieve the above requirement:

You can list all of your multiple json files if they are stored in one/same folder by using below cmd:

df = spark.read.json('give the folder_path where all of your files are stored')

For more information please refer this SO THREAD as suggested by @Camilo Soto & this Blog

Dynamically reading in 1m > JSON files into Pyspark

1 Answers1