0

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows:

%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')

%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'

So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:

%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON

This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df are now contained in table_name twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:

%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')

%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'

%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON

This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO to ignore the initial json files? There's the option modifiedAfter, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running df.write.format('delta').mode('append').save('/mnt/bucket-name/delta') followed by REFRESH TABLE default.table_name, but this seems inefficient, since why should the initial json files be read again? Edit: This method also duplicates the initial load.

Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.

One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
dwolfeu
  • 1,103
  • 2
  • 14
  • 21
  • 1
    do you really need to read all JSON files to infer schema? Isn't enough to read one file? – Alex Ott Apr 06 '22 at 13:06
  • You are correct. I would have to pick out one particular json file though, which would be fine for just one delta table, but the real example I'm working on involves numerous Delta tables, so I would have to dynamically pick one json file for each Delta table. – dwolfeu Apr 06 '22 at 13:25
  • 1
    it could be like, `dbutils.fs.ls(path)[0].path` to find the path of that sample file. Even if the schema of the file is incomplete, then you can use `merge_schema` option to add new columns if there are more – Alex Ott Apr 06 '22 at 13:37

0 Answers0