Databricks update files which name is on DataFrame

Question

I have a DataFrame like:

input_df = self.spark.createDataFrame(
            data=[
                ("01", "file_name_1"),
                ("02", "file_name_2"),
                ("05", "file_name_5"),
            ],
            schema=(
                "RECORD_ID: string, FILE_NAME: string"
            ),
        )

I have a folder /mnt/data/project/integration_test/ with the following files

file_name_1.json
file_name_2.json
file_name_3.json
file_name_4.json

I want to update those json files that are on the `input_df`

I thought the process would be:

Delete json which name appears on input_df
Save each row input_df as individual json (I already solved this)

The final files on /mnt/data/project/integration_test/ would be:

file_name_1.json (updated)
file_name_2.json (updated)
file_name_3.json 
file_name_4.json
file_name_5.json (created new)

score 1 · Answer 1 · answered Oct 11 '22 at 21:21

I think we can use Pandas and then iterate the file name and check if the file exists .

import pandas as pd 
import os
input_df = spark.createDataFrame(
            data=[
                ("01", "test"),
                ("02", "test1"),
                ("05", "test2"),
            ],
            schema=(
                "RECORD_ID: string, FILE_NAME: string"
            ),
        )
display(input_df)
pandas_df = input_df.toPandas()
for index,row in pandas_df.iterrows():
    print(row['FILE_NAME'])
    if os.path.isfile(PATH):
       print("File exists ")
       // Update the existing file 
    else:
       print("File NOT exists ")   
       //Create a new file

Databricks update files which name is on DataFrame

I want to update those json files that are on the input_df

1 Answers1

I want to update those json files that are on the `input_df`