There are 2 sources of data & there is no way to connect these two sources :
- Different AWS with different subscription account (in 1 bucket there is 2 different folders X & Y)
- Databricks with different subscription ID (There is 1 table here)
I have downloaded 2000 json files with more than 80 GBs of data (from 2 folders in s3) with AWS CLI and then uploaded to a 2 different tables in dbfs.
Now, if I load them into a spark dataframe there is no way I can do EDA or perform queries to it. The schema for the table I have created with S3 objects looks like this.
Field 1:string
Field 2:array
element:struct
field2.1:string
field2.2:string
Field 3:array
element:struct
field3.1:string
field3.2:string
Field 4:array
element:struct
field4.1:string
field4.2:array
element:string
I want to do EDA on some of the subfields.
import pyspark.sql.functions as F
df.select(F.explode('field').alias('x')).select('x.field')