how to convert document to pdf in chunks using unoconv

Question

I have a small AWS EC2 instance with a 2GB memory. I'm trying to convert a pptx file to pdf using unoconv and libreoffice. The code work on my local machine but when I deploy it to AWS it converts only files less than 20MB and dies when the file is greater. I'm trying to find a method I can convert In chunks so that the server does not have to read all the file at once. NOTE that this program is being called using Axios so the response must be sent in a way that axios should be able to stream and save using fs.createWriteStream.

Here is my code

chunk_size = 4096  # Adjust this chunk size as needed
 
@app.route('/convert-pptx', methods=['POST'])
def convert_pptx_file():
    try:
        uploaded_file = request.files['document']
        file_path = f'uploaded_{secrets.token_hex(8)}.pptx'
        new_name = file_path.replace(".pptx", ".pdf")
 
        uploaded_file.save(file_path)
 
        # Use subprocess.Popen for streaming conversion
        command = ['unoconv', '--format=pdf', file_path]
        with subprocess.Popen(command, stdout=subprocess.PIPE, bufsize=chunk_size) as process:
            def generate_pdf_from_pptx():
                while True:
                    chunk = process.stdout.read(chunk_size)
                    if not chunk:
                        break
                    yield chunk
 
            response = Response(
                generate_pdf_from_pptx(),
                content_type='application/pdf',
                headers={'Content-Disposition': f'attachment; filename={new_name}'}
            )
 
        # Clean up
        os.remove(file_path)
        atexit.register(lambda: os.remove(new_name))
 
        return response
 
    except Exception as e:
        return jsonify({'error': str(e)}), 500

I'm expecting to be able to convert files greater than 20MB but the program hangs and when I stop t it, I get the error.

uno.RuntimeException: Binary URP bridge disposed during call

It looks like currently, the code converts the entire file in one go using `unoconv`, which would explain the memory problem. Possibly it would work to first split up the pptx into a separate file for each slide, if each slide is guaranteed to be less than 20MB, by unzipping the pptx file and carefully modifying the XML files. To avoid memory overhead, an improvement could probably be made by using `soffice --convert-to` directly instead of the `unoconv` wrapper. Then use a tool that requires less memory to merge the PDF pages. — Jim K, Aug 31 '23 at 23:50
I guess with a popular tag like `python`, some people feel it's okay to downvote questions without explaining how to correct them. Doesn't make sense to me though. — Jim K, Aug 31 '23 at 23:51

how to convert document to pdf in chunks using unoconv

0 Answers0