I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2.
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("myfile.pdf")
page=pdf[1]
textpage = page.get_textpage()
Most of the text is readable but for some reason the important data is not readable when extracted. In the extracted string the relevant part is like this
Readable text \r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15 readable text
I tried also with tika and PyMuPDF. They only give me the questionmarkcharacter for those parts.
I know the mangled part (\r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15
) should be 3,0 8,8 +0,058/0 5,0 4,0 4,5
.
My current idea is to make my own encoding table but i wanted to ask if there is a better method and if this looks familiar to someone.
I have about 52 files whith around 200 occurences each.
While the pdfs are not confidential i dont want to post links because it is not my intelectual property.
Update------------------------------
I tried to find out more about the fonts.
from pdfreader import PDFDocument
fd = open("myfile", "rb")
doc = PDFDocument(fd)
page = next(doc.pages())
font_keys=sorted(page.Resources.Font.keys())
for font_key in font_keys:
font = page.Resources.Font[font_key]
print(f"{font_key}: {font.Subtype}, {font.BaseFont}, {font.Encoding}")
gives:
R13: Type0, UHIIUQ+MetaPlusBold-Roman-Identity-H, Identity-H
R17: Type0, EWGLNL+MetaPlusBold-Caps-Identity-H, Identity-H
R20: Type1, NRVKIY+Meta-LightLF, {'Type': 'Encoding', 'BaseEncoding': 'WinAnsiEncoding', 'Differences': [33, 'agrave', 'degree', 39, 'quoteright', 177, 'endash']}
R24: Type0, IKRCND+MetaPlusBold-Italic-Identity-H, Identity-H
-Edit------ I am not interested in help tranlating it manually. I can do that by myself. i am interested in a solution that works by script. For example a script that extracts fonts with codemaps from the pdf and then uses those to translate the unreadable parts