How to decode output of LTTextLine.get_text()?

Question

I'm new to PDFminer. I noticed that some symbols/punctuations are not output in the literal form when using PDFminer get_text() command. For example, "-" came out as "\xe2\x80\x93" and single quotes ' came out as "\xe2\x80\x99". Here's the command I used:

print(LTTextLine.get_text().encode('UTF-8'))

Can somebody help me understand how to read these, and transform it back to the literal form?

Thanks.

I suspect that they weren't really hyphens `-` (ASCII 45) and single quotes `'` (ASCII 39) but instead typographic hyphens `‐` (Unicode 2010) and Unicode right single quotation marks `’` (Unicode 2019). — BoarGules, Jan 31 '19 at 15:48
looks like the \x?? are Python escape characters as in https://stackoverflow.com/questions/2672326/what-does-a-leading-x-mean-in-a-python-string-xaa but I still don't have a clue how I can properly encode these symbols. Tried all the utf* supported encodings in https://docs.python.org/3/library/codecs.html#standard-encodings to no avail. — muon3, Feb 01 '19 at 10:53

score 0 · Accepted Answer · answered Feb 01 '19 at 12:27

The hex escapes that you don't like aren't hyphens (ASCII 39) and single quotes (ASCII 45). They are typographic hyphens (a.k.a. en-dashes) – (Unicode 2010) and right single (“smart”) quotation marks ’ (Unicode 2019). They are encoded as UTF-8. If you want to decode them, treat the string that contains them as bytes, not a string (note the b prefix):

>>> mystring = b"This is an en\xe2\x80\x93dash and this - isn\xe2\x80\x99t"
>>> mystring.decode('UTF8')
'This is an en–dash and this - isn’t'

If Python thinks the data is already a string, as below, with no b prefix, then you need to convince it that it is really bytes, and decode the result:

>>> mystring = "This is an en\xe2\x80\x93dash and this - isn\xe2\x80\x99t"
>>> bytes(mystring.encode("latin-1")).decode("UTF-8")
'This is an en–dash and this - isn’t'

In the font that SO uses there is not much obviously different between ASCII 39 and Unicode 2010 but in general the typographic hyphen is shorter, thicker and closer to the baseline than the ASCII hyphen. The distinction between the other two is fairly clear in isn’t'. It's common to find the Unicode variants in .pdf files because they are intended to be printed. The ASCII variants are really only appropriate in program code, and emulations of old typewriters; not printed books and magazines.

How to decode output of LTTextLine.get_text()?

1 Answers1