How would i save a doc/docx/docm file into directory or S3 bucket using Pyspark

Question

I am trying to save a data frame into a document but it returns saying that the below error

java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html

My code is below:

       #f_data is my dataframe with data
       f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
       display(f_data)

Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?

My question here. Do we have the support for saving data in the format of doc/docx?

if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?

score 0 · Answer 1 · answered Dec 03 '22 at 13:51

In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.

Long answer: A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.

If you want to write DOCX files programmatically, you can:

Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?

Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.

Thanks for your response @Emer. Due to some restrictions from the tool. I need to use only pyspark code to upload file stream into S3 bucket. I have a file stream object with that is there any way to upload it into S3 using pyspark? I expect writing of DOCX file stream object into S3 in a way like f_data = f_data.writeStream() — Ramesh Bathini, Dec 04 '22 at 01:29

How would i save a doc/docx/docm file into directory or S3 bucket using Pyspark

1 Answers1