How do I parallel write JSON files to a mounted directory using Spark in Databricks

Question

I have an RDD of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata (using Azure). I tried the following, but it turns out that I can't use dbutils inside a Spark job.

def write_json(output_path, json_data):
     dbutils.fs.put(output_path, json_data)

What I currently must do is bring the data locally (to the driver) and then call the write_json method.

records = my_rdd.collect()
for r in records:
     write_json(r['path'], r['json'])

This approach works, but takes forever to finish. Is there a faster way?

What does your rdd look like? Does it have one fully formed json per records? — D3V, Apr 09 '19 at 13:49

score 2 · Accepted Answer · answered Apr 09 '19 at 15:43

2

You can use map to perform this operation in parallel.

def write_json(output_path, json_data):
    with open(output_path, "w") as f:
        f.write(json_data)

my_rdd.map(lambda r: write_json(r['path'], r['json']))

answered Apr 09 '19 at 15:43

D3V

1,543
11
21

How do I parallel write JSON files to a mounted directory using Spark in Databricks

1 Answers1