17

I'm doing right now Introduction to Spark course at EdX. Is there a possibility to save dataframes from Databricks on my computer.

I'm asking this question, because this course provides Databricks notebooks which probably won't work after the course.

In the notebook data is imported using command:

log_file_path = 'dbfs:/' + os.path.join('databricks-datasets', 'cs100', 'lab2', 'data-001', 'apache.access.log.PROJECT')

I found this solution but it doesn't work:

df.select('year','model').write.format('com.databricks.spark.csv').save('newcars.csv')

Josh Rosen
  • 13,511
  • 6
  • 58
  • 70
Tom Becker
  • 191
  • 1
  • 1
  • 5

3 Answers3

46

Databricks runs a cloud VM and does not have any idea where your local machine is located. If you want to save the CSV results of a DataFrame, you can run display(df) and there's an option to download the results.

enter image description here

MrChristine
  • 1,461
  • 13
  • 13
  • 2
    Thanks for sharing this MrChristine. I tried so many coding solutions to get my df downloaded. This is the only thing that actually worked for me. But it seems like you can see, and download, only 1000 rows. How can I download ALL ROWS? – ASH Oct 13 '19 at 13:28
  • 3
    @ASH click on download full results and then the command will re-run and after the execution is completed you can download it. – MathGeek Aug 24 '20 at 11:49
  • I'm getting an error doing that on databricks community, but I can download the preview (1000 rows) – Jose Macedo Oct 27 '20 at 13:29
  • @MrChristine Is there a way to automate this download? – noswear Nov 03 '22 at 05:41
18

You can also save it to the file store and donwload via its handle, e.g.

df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save("dbfs:/FileStore/df/df.csv")

You can find the handle in the Databricks GUI by going to Data > Add Data > DBFS > FileStore > your_subdirectory > part-00000-...

Download in this case (for Databricks west europe instance)

https://westeurope.azuredatabricks.net/files/df/df.csv/part-00000-tid-437462250085757671-965891ca-ac1f-4789-85b0-akq7bc6a8780-3597-1-c000.csv

I haven't tested it but I would assume the row limit of 1 million rows that you would have when donwloading it via the mentioned answer from @MrChristine does not apply here.

Triamus
  • 2,415
  • 5
  • 27
  • 37
2

Try this.

df.write.format("com.databricks.spark.csv").save("file:///home/yphani/datacsv")

This will save the file into Unix Server.

if you give only /home/yphani/datacsv it looks for the path on HDFS.

yoga
  • 1,929
  • 2
  • 15
  • 18