I have an RDD
of 50,000 JSON files that I need to write to a mounted directory in Spark (Databricks). The mounted path looks something like /mnt/myblob/mydata
(using Azure). I tried the following, but it turns out that I can't use dbutils
inside a Spark job.
def write_json(output_path, json_data):
dbutils.fs.put(output_path, json_data)
What I currently must do is bring the data locally (to the driver) and then call the write_json
method.
records = my_rdd.collect()
for r in records:
write_json(r['path'], r['json'])
This approach works, but takes forever to finish. Is there a faster way?