Correctly extract PDF within PDF - Python

Question

I have a PDF embedded on a PDF. I've tried multiple ways of extracting it, but when I save it I get back the same original PDF (With the embedded one). I only want to get the embedded PDF.

I'm open to do it in another programming language, the only condition is not doing it manually (not opening each pdf and clicking on the file then saving it).

I've tried all of stackoverflow answers, they mostly work but i get the same mistake when saving.

Note: Original PDF only consists of one page and one embedded attachment, I haven't been able to use any options that include pages because I get errors.

This one does say I have an embedded file (same mistake when saving):

import fitz  # PyMuPDF

doc = fitz.open(filepath_PDF)  # open the PDF
count = doc.embfile_count()
print("number of embedded files:", count)

if count > 0:
    buff = doc.embfile_get(0)
    with open("extracteddd.pdf", "wb") as fout:
        fout.write(buff)
    print("Extracted PDF saved.")
else:
    print("No embedded files found.")

With this one "file" seems to be in binary (and same mistake when saving):


import PyPDF2 as pf  

pdf = pf.PdfReader(filepath_PDF)  

catalog = pdf.trailer['/Root']  
fDetail = catalog['/Names']['/EmbeddedFiles']['/Names']  
soup = fDetail[1].get_object()  

file = soup['/EF']['/F'].get_data()

fout = open("testss.pdf", "wb")   # open output file
fout.write(file)
fout.close()

score 0 · Answer 1 · answered Aug 31 '23 at 01:54

0

This example works as expected and saves the right embedded file:

In [1]: import fitz
In [2]: import pathlib
In [3]: doc=fitz.open("test.pdf")
In [4]: doc.embfile_names()
Out[4]: ['file1.pdf', 'file2.pdf']
In [5]: buff = doc.embfile_get("file1.pdf")
In [6]: len(buff)
Out[6]: 6751735
In [7]: pathlib.Path("file1.pdf").write_bytes(buff)
Out[7]: 6751735

answered Aug 31 '23 at 01:54

Jorj McKie

2,062
1
13
17

I changed it to `import fitz import pathlib doc=fitz.open(filepath_PDF) doc.embfile_names() buff = doc.embfile_get(0) len(buff) pathlib.Path('embedded.pdf').write_bytes(buff)` to make it work but it still saves the original PDF with the embedded one :( I took a look into `doc.embfile_get(0)` and though I don't understand what it says it looks like it's both PDFs I think the mistakes happens bc when getting the embedded it's also getting the original PDF – ilia Aug 31 '23 at 17:10
Hm https://stackoverflow.com/users/22295631/ilia this seems to have no apparent explanation. Are you able to share the file? – Jorj McKie Sep 01 '23 at 19:31

Correctly extract PDF within PDF - Python

1 Answers1