How is the text from this pdf encoded?

Question

I have some pdfs with data about machine parts and i am trying to extract sizes. I extracted the text from a pdf via pypdfium2.

import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("myfile.pdf")
page=pdf[1]
textpage = page.get_textpage()

Most of the text is readable but for some reason the important data is not readable when extracted. In the extracted string the relevant part is like this

Readable text \r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15 readable text

I tried also with tika and PyMuPDF. They only give me the questionmarkcharacter for those parts.

I know the mangled part (\r\n\x13\x0c\x10 \x18\x0c\x18 \x0b\x10\x0e\x10\x15\x18\x0f\x10 \x15\x0c\x10 \x14\x0c\x10 \x14\x0c\x15) should be 3,0 8,8 +0,058/0 5,0 4,0 4,5. My current idea is to make my own encoding table but i wanted to ask if there is a better method and if this looks familiar to someone. I have about 52 files whith around 200 occurences each. While the pdfs are not confidential i dont want to post links because it is not my intelectual property.

Update------------------------------

I tried to find out more about the fonts.

from pdfreader import PDFDocument
fd = open("myfile", "rb")
doc = PDFDocument(fd)
page = next(doc.pages())
font_keys=sorted(page.Resources.Font.keys())

for font_key in font_keys:
    font = page.Resources.Font[font_key]
    print(f"{font_key}: {font.Subtype}, {font.BaseFont}, {font.Encoding}")

gives:

R13: Type0, UHIIUQ+MetaPlusBold-Roman-Identity-H, Identity-H
R17: Type0, EWGLNL+MetaPlusBold-Caps-Identity-H, Identity-H
R20: Type1, NRVKIY+Meta-LightLF, {'Type': 'Encoding', 'BaseEncoding': 'WinAnsiEncoding', 'Differences': [33, 'agrave', 'degree', 39, 'quoteright', 177, 'endash']}
R24: Type0, IKRCND+MetaPlusBold-Italic-Identity-H, Identity-H

-Edit------ I am not interested in help tranlating it manually. I can do that by myself. i am interested in a solution that works by script. For example a script that extracts fonts with codemaps from the pdf and then uses those to translate the unreadable parts

That does not look like a sane and well-defined encoding, no. Most contemporary encodings avoid using the character positions 0x00-0x1f which are control codes in ASCII. — tripleee, Nov 22 '22 at 15:45
To prevent copies (or just to make it more difficult), the document could use different characters, and define own fonts. So you will see words, but the encoded value doesn't make sense. — Giacomo Catenazzi, Nov 22 '22 at 16:25

K J · Answer 1 · 2022-11-22T16:47:43.523

This is not uncommon CID CMAP substitution as output in python notation, and is usua;;y specific to a single font with 6 random ID e.g.UHIIUQ+Font name
often found for subsetting fonts that have a limited range of characters.

should be 3,0 8,8 +0,058/0 5,0 4,0 4,5

\r\n\ = cR Nl (windows line feed \x0d\x0a)
\x13 has been mapped to 3
\x0c has been mapped to ,
\x10 has been mapped to 0
 (literal nbsp)
\x18 = 8
\x0c = ,
\x18 = 8
 (literal nbsp)
\x0b has been mapped to +
\x10 = 0
\x0e has been mapped to , (very odd see \x0c)
\x10 = 0
\x15 = 5
\x18 = 8
\x0f has been mapped to /
\x10 = 0
 (literal nbsp)
\x15 etc......................
\x0c
\x10
 
\x14
\x0c
\x10
 
\x14
\x0c
\x15

so \x0# are low order control codes & punctuation
and \x1# are digits

unknown if \x2# are used for letters, the CMAP table should be queried for the full details

\x0e has been mapped to , (very odd see \x0c)
I suspect as its different that should possibly be decimal separator dot ?

Fonts in PDFs may or may not have CMAPs, and if a CMAP exists, it may be incomplete - either by some error or on purpose. Extract the CMAP via PyMuPDF's low-level code: determine font xref, from that determine CMAP xref (PDF key "/ToUnicode"), then extract the CMAP's decompressed stream. Post me for details. — Jorj McKie, Nov 26 '22 at 11:35

score 1 · Answer 2 · answered Nov 26 '22 at 11:47

Here is example code to get the source of a font's CMAP with PyMuPDF:

import fitz
doc = fitz.open("some.pdf")
# assume that we know a font's xref already
# extract the xref of its CMAP:
cmap_xref = doc.xref_get_key(xref, "ToUnicode")[1]  # second string is 'nnn 0 R'
if cmap_xref.endswith("0 R"):  # check if a CMAP exists at all
    cxref = int(cmap_xref.split()[0])
else:
    raise ValueError("no CMAP found")
print(doc.xref_stream(cxref).decode())  # convert bytes to string
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CMapType 2 def
/CMapName/R63 def
1 begincodespacerange
<00><ff>
endcodespacerange
12 beginbfrange
<20><20><0020>
<2e><2e><002e>
<30><31><0030>
<43><46><0043>
<49><49><0049>
<4c><4d><004c>
<4f><50><004f>
<61><61><0061>
<63><69><0063>
<6b><70><006b>
<72><76><0072>
<78><79><0078>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end end

How is the text from this pdf encoded?

2 Answers2