I have a PDF embedded on a PDF. I've tried multiple ways of extracting it, but when I save it I get back the same original PDF (With the embedded one). I only want to get the embedded PDF.
I'm open to do it in another programming language, the only condition is not doing it manually (not opening each pdf and clicking on the file then saving it).
I've tried all of stackoverflow answers, they mostly work but i get the same mistake when saving.
Note: Original PDF only consists of one page and one embedded attachment, I haven't been able to use any options that include pages because I get errors.
This one does say I have an embedded file (same mistake when saving):
import fitz # PyMuPDF
doc = fitz.open(filepath_PDF) # open the PDF
count = doc.embfile_count()
print("number of embedded files:", count)
if count > 0:
buff = doc.embfile_get(0)
with open("extracteddd.pdf", "wb") as fout:
fout.write(buff)
print("Extracted PDF saved.")
else:
print("No embedded files found.")
With this one "file" seems to be in binary (and same mistake when saving):
import PyPDF2 as pf
pdf = pf.PdfReader(filepath_PDF)
catalog = pdf.trailer['/Root']
fDetail = catalog['/Names']['/EmbeddedFiles']['/Names']
soup = fDetail[1].get_object()
file = soup['/EF']['/F'].get_data()
fout = open("testss.pdf", "wb") # open output file
fout.write(file)
fout.close()