Create a delta table in DBFS

Asked Jun 21 '23 at 22:04

Active Jun 22 '23 at 06:44

Viewed 34 times

There are 2 sources of data & there is no way to connect these two sources :

Different AWS with different subscription account (in 1 bucket there is 2 different folders X & Y)
Databricks with different subscription ID (There is 1 table here)

I have downloaded 2000 json files with more than 80 GBs of data (from 2 folders in s3) with AWS CLI and then uploaded to a 2 different tables in dbfs.

Now, if I load them into a spark dataframe there is no way I can do EDA or perform queries to it. The schema for the table I have created with S3 objects looks like this.

Field 1:string
Field 2:array
   element:struct
     field2.1:string
     field2.2:string
Field 3:array
   element:struct
     field3.1:string
     field3.2:string
Field 4:array
   element:struct
     field4.1:string
     field4.2:array
         element:string

I want to do EDA on some of the subfields.

import pyspark.sql.functions as F
df.select(F.explode('field').alias('x')).select('x.field')

edited Jun 22 '23 at 06:44

Alex Ott

80,552
8
87
132

asked Jun 21 '23 at 22:04

slinger

what is the real question? – Alex Ott Jun 22 '23 at 06:45
Never mind, resolved. – slinger Jun 23 '23 at 21:12

Create a delta table in DBFS

0 Answers0