Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c

Question

I am trying to open a file with PyMuPDF, do some edits, and then return it to the frontend.

Following there is the code

@app.post('/return_pdf')
async def return_pdf(uploaded_pdf: UploadFile):
    print("Filetype: ", type(uploaded_pdf)) # <class 'starlette.datastructures.UploadFile'>
    document =  fitz.open(stream=BytesIO(uploaded_pdf.file.read()))
    for page in document:
        for area in page.get_text('blocks'):
            box = fitz.Rect(area[:4])
            if not box.is_empty:
                page.add_rect_annot(box)
    
    return {'file': document.tobytes()}

The error I get is the following: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c in position 702: invalid start byte

How can I solve this problem? Thanks in advance

Regarding reading the file, I tried several methods, but apparently BytesIO(uploaded_pdf.file.read()) was the only one accepted by PyMuPDF.

Regarding returning the file, I tried to return it directly, without converting in bytes, but I got a similar error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 10: invalid continuation byte

I though about changing the econding and tried to insert it into fitz.open() but it was not a param.

Harshavardhan · Accepted Answer · 2023-04-01T16:57:28.627

You can return a PDF file-like response by returning a 'FileResponse' object from the 'starlette.responses' module.

from starlette.responses import FileResponse

@app.post('/return_pdf')
async def return_pdf(uploaded_pdf: UploadFile):
    document = fitz.open(stream=BytesIO(uploaded_pdf.file.read()), filetype="pdf")
    for page in document:
        for area in page.get_text('blocks'):
            box = fitz.Rect(area[:4])
            if not box.is_empty:
                page.add_rect_annot(box)
    
    output_pdf = BytesIO()
    document.save(out_pdf)
    output_pdf.seek(0)
    
    return FileResponse(out_pdf, filename="edited.pdf")

We create a object to hold the pdf which is named 'BiteIO,The edited pdf is saved in 'document.save()',Then reset the buffer position using 'output_pdf.seek(0)' and return as 'textFileResponse(filename)'.

I hope this might help you.

Python UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9c

1 Answers1