2

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a Spark job.

def write_json(output_path, json_data):
     dbutils.fs.put(output_path, json_data)

What I currently must do is bring the data locally (to the driver) and then call the write_json method.

records = my_rdd.collect()
for r in records:
     write_json(r['path'], r['json'])

This approach works, but takes forever to finish. Is there a faster way?

Jane Wayne
  • 8,205
  • 17
  • 75
  • 120

1 Answers1

2

You can use map to perform this operation in parallel.

def write_json(output_path, json_data):
    with open(output_path, "w") as f:
        f.write(json_data)

my_rdd.map(lambda r: write_json(r['path'], r['json']))
D3V
  • 1,543
  • 11
  • 21