0

There are 2 sources of data & there is no way to connect these two sources :

  1. Different AWS with different subscription account (in 1 bucket there is 2 different folders X & Y)
  2. Databricks with different subscription ID (There is 1 table here)

I have downloaded 2000 json files with more than 80 GBs of data (from 2 folders in s3) with AWS CLI and then uploaded to a 2 different tables in dbfs.

Now, if I load them into a spark dataframe there is no way I can do EDA or perform queries to it. The schema for the table I have created with S3 objects looks like this.

Field 1:string
Field 2:array
   element:struct
     field2.1:string
     field2.2:string
Field 3:array
   element:struct
     field3.1:string
     field3.2:string
Field 4:array
   element:struct
     field4.1:string
     field4.2:array
         element:string

I want to do EDA on some of the subfields.

import pyspark.sql.functions as F
df.select(F.explode('field').alias('x')).select('x.field')
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
slinger
  • 1
  • 2

0 Answers0