0

I have managed to read in all the files and create the tables manually based on the JSON schema. However, I am unsure how to do this dynamically, i.e. if any changes to the json files the values are read in automatically based on the json schema.

Manually:

df=spark.read.json("path/*millionsofjson.json")

Reviewed nested schema

df.printSchema

Reading in Metadata Table

df_table1=df.select("metadata")

df_table1_select=("df_table1.column1","df_table1.column2"..."df_table1.column20")

df_table1_select.show()

Reading in Orders Table

df_table2=df.select("metadata")

df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column50")

df_table2_select.show()

Reading in Sales Table

df_table2=df.select("metadata")

df_table2_select=("df_table1.column1","df_table1.column2"..."df_table1.column35")

df_table2_select.show()

Hopefully, this explains what I am after...

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88
rg294835
  • 1
  • 2
  • You have all files of various tables, mixed in one place, you create a single dataframe and then you create from it multiple tables, by picking the relevant fields for each table? – David דודו Markovitz Mar 14 '22 at 05:21
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Mar 15 '22 at 13:52

1 Answers1

0

One of the workaround may help you out to achieve the above requirement:

You can list all of your multiple json files if they are stored in one/same folder by using below cmd:

df = spark.read.json('give the folder_path where all of your files are stored')

For more information please refer this SO THREAD as suggested by @Camilo Soto & this Blog

AjayKumarGhose
  • 4,257
  • 2
  • 4
  • 15