Unfamiliar format in pdf difference array

Question

I'm trying to decode a pdf to get the text from it, but I am having an issue using the differences arrays. The differences array I extract from the document I am working with comes in this format:

'BaseEncoding': 'WinAnsiEncoding', 'Differences': [1, 'g39', 'g38', 'g51', ';#23#23#23', ';#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23#23', 'g40', 'g79', 'g72', 'g70', 'g87', 'g85', 'g82', 'g81', 'g76', 'g54'...]

I've found explanations for how to use the other formats of differences tables such as:

/Differences [ 24 /breve/caron/circumflex/dotaccent/hungarumlaut/ogonek/ring/tilde 39 /quotesingle 96 /grave 128 /bullet/dagger/daggerdbl/ellipsis... ]

Where the number code tells you what character is meant to be used, but I can't seem to find an explanation for how to use the first type of difference table.

Edit: Here's the file

Please share the pdf in which you found the first syntax. Because it is clearly not pdf syntax. — mkl, Mar 16 '19 at 10:36
Technically you have to use the ToUnicode cmap to extract the text in the first situation. If it doesn't exist you can simply cut the 'g' in front and use the number as character code but I can't guarantee that the results are valid. This is not standard, it is just a hack. — Mihai Iancu, Mar 17 '19 at 06:16

score 0 · Answer 1 · answered Mar 15 '19 at 21:52

0

Section 9.6.6 Character Encoding of the ISO PDF32000-1:2008 specification describes the Differences key of an /Encoding dictionary as:

An array describing the differences from the encoding specified by BaseEncoding or, if BaseEncoding is absent, from an implicit base encoding. The Differences array is described in subsequent sub-clauses.

In this case it's specifying the differences from WinAnsiEncoding.

answered Mar 15 '19 at 21:52

JosephA

1,187
3
13
27

The doc you linked gives examples that match the second array and explain how to interpret it. I'm not sure if I missed it, but I couldn't find any examples or explanations that match the format of the first array in my post. Could you explain how would I go about interpreting the first array? – GriffithN Mar 15 '19 at 22:53

Unfamiliar format in pdf difference array

1 Answers1