PDF data stream with python

Question

Context: My code fetches a set of coordinates from some png documents and later on performs some redaction in certain fields (it uses these coordinates for drawing rectangles in certain areas).

I want my final output to be a pdf with each redacted image as page. I can achieve this with fpdf package with no problem.

However, I intend to send this pdf file as email (base64 encoded) attachment. Is there any way to get the base64 string from fpdf output?

On top of that, can I use image binary string in fpdf image method?

See the redact_pdf method below (I placed some comments there to be more clear)

Code:

class Redaction:
def __init__(self,png_image_list,df_coordinates):
    self.png_image_list = png_image_list
    self.df_coordinates = df_coordinates
    
def _redact_images(self):
    redacted_images_bin = []
    for page_num,page_data in enumerate(self.png_image_list):
        im_page = Image.open(io.BytesIO(page_data))
        draw = ImageDraw.Draw(im_page)
        df_filtered = self.df_coordinates[self.df_coordinates['page_number'] == page_num+1]
        for index, row in df_filtered.iterrows():
            x0 = row['x0'] * im_page.size[0]
            y0 = row['y0'] * im_page.size[1]
            x1 = row['x1'] * im_page.size[0]
            y1 = row['y1'] * im_page.size[1]
            x2 = row['x2'] * im_page.size[0]
            y2 = row['y2'] * im_page.size[1]
            x3 = row['x3'] * im_page.size[0]
            y3 = row['y3'] * im_page.size[1]
            coords = [x0,y0,x1,y1,x2,y2,x3,y3]
            draw.polygon(coords,outline='blue',fill='yellow')
        redacted_images_bin.append(im_page)
        
    return redacted_images_bin

def redacted_pdf(self):
    redacted_images = self._redact_images()
    pdf = FPDF()
    pdf.set_auto_page_break(0)
    for index,img_redacted in enumerate(redacted_images):
        img_redacted.save(f"image_{index}.png")
        pdf.add_page()
        pdf.image(f"image_{index}.png",w=210,h=297)
        os.remove(f"image_{index}.png") # I would like to avoid file handling!
    pdf.output("doc.pdf","F") # I would like to avoid file handling!
    #return pdf #this is what I want, to return the pdf as base64 or binary

if you save in file then you can read it as `bytes` and use standard module `base64` -ie. `base64.b64encode()`. You may also use standard `io.BytesIO()` to create file in memory and then you don't have to save file on hard drive. The same way you may use `io.BytesIO()` to create file image in memory and use it instead of file on hard drive. Many functions may read/write `io.BytesIO()` (file-like object) instead of `filename` — furas, Feb 17 '22 at 20:02
on Stackoverflow you may find questions which show how to use `io.BytesIO` and `base64` to send image (or other file) from `Flask` to HTML/JavaScript. — furas, Feb 17 '22 at 20:04

furas · Accepted Answer · 2022-02-18T13:28:51.140

In documentation I found that you can get PDF as string using

pdf_string = pdf.output(dest='S')

so you can use standard module base64

import fpdf
import base64

pdf = fpdf.FPDF()

# ... add some elements ...

pdf_string = pdf.output(dest='S')
pdf_bytes  = pdf_string.encode('utf-8')

base64_bytes  = base64.b64encode(pdf_bytes)
base64_string = base64_bytes.decode('utf-8')

print(base64_string)

Result:

JVBERi0xLjMKMyAwIG9iago8PC9UeXBlIC9QYWdlCi9QYXJlbnQgMSAwIFIKL1Jlc291cmNlcyAyIDAgUgovQ29udGVudHMgNCAwIFI+PgplbmRvYmoKNCAwIG9iago8PC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9MZW5ndGggMTk+PgpzdHJlYW0KeMKcM1LDsMOiMsOQMzVXKMOnAgALw7wCEgplbmRzdHJlYW0KZW5kb2JqCjEgMCBvYmoKPDwvVHlwZSAvUGFnZXMKL0tpZHMgWzMgMCBSIF0KL0NvdW50IDEKL01lZGlhQm94IFswIDAgNTk1LjI4IDg0MS44OV0KPj4KZW5kb2JqCjIgMCBvYmoKPDwKL1Byb2NTZXQgWy9QREYgL1RleHQgL0ltYWdlQiAvSW1hZ2VDIC9JbWFnZUldCi9Gb250IDw8Cj4+Ci9YT2JqZWN0IDw8Cj4+Cj4+CmVuZG9iago1IDAgb2JqCjw8Ci9Qcm9kdWNlciAoUHlGUERGIDEuNy4yIGh0dHA6Ly9weWZwZGYuZ29vZ2xlY29kZS5jb20vKQovQ3JlYXRpb25EYXRlIChEOjIwMjIwMjE3MjExMDE3KQo+PgplbmRvYmoKNiAwIG9iago8PAovVHlwZSAvQ2F0YWxvZwovUGFnZXMgMSAwIFIKL09wZW5BY3Rpb24gWzMgMCBSIC9GaXRIIG51bGxdCi9QYWdlTGF5b3V0IC9PbmVDb2x1bW4KPj4KZW5kb2JqCnhyZWYKMCA3CjAwMDAwMDAwMDAgNjU1MzUgZiAKMDAwMDAwMDE3NSAwMDAwMCBuIAowMDAwMDAwMjYyIDAwMDAwIG4gCjAwMDAwMDAwMDkgMDAwMDAgbiAKMDAwMDAwMDA4NyAwMDAwMCBuIAowMDAwMDAwMzU2IDAwMDAwIG4gCjAwMDAwMDA0NjUgMDAwMDAgbiAKdHJhaWxlcgo8PAovU2l6ZSA3Ci9Sb290IDYgMCBSCi9JbmZvIDUgMCBSCj4+CnN0YXJ0eHJlZgo1NjgKJSVFT0YK

As for image(): it needs filename (or url) and it can't work with string or io.BytesIO().

Eventually you may get source code and you can try to change it.

There is even request on GitHub: Support for StringIO objects as images

EDIT:

I found that there is fork fpdf2 which can use pillow.Image in image() - see fpdf2 Image

And in source code I found image() can also work with io.BytesIO()

Example code for fpdf2 (output() gives bytes instead of string)

import fpdf
import base64
from PIL import Image
import io

#print(fpdf.__version__)

pdf = fpdf.FPDF()

pdf.add_page()

pdf.image('lenna.png')

pdf.image('https://upload.wikimedia.org/wikipedia/en/7/7d/Lenna_%28test_image%29.png')

f = open('lenna.png', 'rb')
pdf.image(f)

f = Image.open('lenna.png')
pdf.image(f)

f = open('lenna.png', 'rb')
b = io.BytesIO()
b.write(f.read())
pdf.image(b)

# save in file
pdf.output('output.pdf')

# get as bytes
pdf_bytes = pdf.output()

#print(pdf_bytes)

base64_bytes  = base64.b64encode(pdf_bytes)
base64_string = base64_bytes.decode('utf-8')

print(base64_string)

Wikipedia: Lenna [image]

Test for writing in fpdf2

import fpdf   

pdf = fpdf.FPDF()

pdf.add_page()
pdf.image('https://upload.wikimedia.org/wikipedia/en/7/7d/Lenna_%28test_image%29.png')

# --- test 1 ---

pdf.output('output-test-1.pdf')

# --- test 2 ---

pdf_bytes = pdf.output()

with open('output-test-2.pdf', 'wb') as f:  # it will close automatically
    f.write(pdf_bytes)

# --- test 2 ---

pdf_bytes = pdf.output()

f = open('output-test-3.pdf', 'wb')
f.write(pdf_bytes)
f.close()  # don't forget to close when you write

For some reason performing this: pdf_string = pdf.output(dest='S') pdf_bytes = pdf_string.encode('utf-8') Does not return a valid pdf binary (I tried to write it in a pdf file and then opened it and it was not the actual pdf, it was a blank pdf!) — Andoni, Feb 18 '22 at 12:55
first you could use `print()` to see what you get in variables. You could also open PDF in text viewer/editor to see what you get in file. But if it opens it without any errors then it is correct PDF but without content. In first example I don't add any pages/text/images so it create empty PDF. — furas, Feb 18 '22 at 13:16
BTW: after installing `fpdf2` I don't have access to original `fpdf` (because both use the same names for modules) and I can't test this problem. Besides `fpdf2` seems more useful so I would forget `fpdf` — furas, Feb 18 '22 at 13:19
I added code which writes PDF using three methods but I tested it only with `fpdf2` — furas, Feb 18 '22 at 13:29
I will have a look at fpdf2 (I did all my tests with fpdf). Really appreciate your help here. As soon as I test it I will write here what I saw. Thank you! — Andoni, Feb 19 '22 at 10:22

PDF data stream with python

1 Answers1