Writing a Python pdfrw PdfReader object to an array of bytes / filestream

Question

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.

So, here's the issue. I'm not interested in writing the edited PDF to a file. The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.

Let me specify what I mean by this. I currently have a piece of code like this:

class FormFiller:

    def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
        template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
            # <some editing magic here>
        return template_pdf

    def fillForm(self,mapper : FieldMapper):
        value_mapping : dict = mapper.getValues()
        filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
        #<this point is crucial>

    def __init__(self, filestream : bytes):
        self.filesteam : bytes = filestream

So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.

How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?

Please help :-(

What happens if you create an empty `io.BytesIO` object and then `write(new_bytes_object)`? — MattDMo, Aug 30 '21 at 14:04
Would you like me to write an "official" answer that explains the background a little bit? — MattDMo, Aug 30 '21 at 14:17
Yes, please. I'll mark it as the correct one. Maybe it'll help out someone else in the future — Maciej Witkowski, Aug 30 '21 at 14:19

MattDMo · Accepted Answer · 2022-10-26T14:42:09.627

To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:

from io import BytesIO

new_bytes_object = BytesIO()

Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:

pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before

This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:

with open("output.txt", "a") as f:
    f.write(some_data)

Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.

new_bytes_object.seek(0)

Writing a Python pdfrw PdfReader object to an array of bytes / filestream

1 Answers1

Linked

Related