The hex escapes that you don't like aren't hyphens (ASCII 39) and single quotes (ASCII 45). They are typographic hyphens (a.k.a. en-dashes) –
(Unicode 2010) and right single (“smart”) quotation marks ’
(Unicode 2019). They are encoded as UTF-8. If you want to decode them, treat the string that contains them as bytes, not a string (note the b
prefix):
>>> mystring = b"This is an en\xe2\x80\x93dash and this - isn\xe2\x80\x99t"
>>> mystring.decode('UTF8')
'This is an en–dash and this - isn’t'
If Python thinks the data is already a string, as below, with no b
prefix, then you need to convince it that it is really bytes, and decode the result:
>>> mystring = "This is an en\xe2\x80\x93dash and this - isn\xe2\x80\x99t"
>>> bytes(mystring.encode("latin-1")).decode("UTF-8")
'This is an en–dash and this - isn’t'
In the font that SO uses there is not much obviously different between ASCII 39 and Unicode 2010 but in general the typographic hyphen is shorter, thicker and closer to the baseline than the ASCII hyphen. The distinction between the other two is fairly clear in
isn’t'
. It's common to find the Unicode variants in .pdf
files because they are intended to be printed. The ASCII variants are really only appropriate in program code, and emulations of old typewriters; not printed books and magazines.