PDFtotext - whitespace showing as aacute on commandline

Question

I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.

The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.

In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).

I also printed to the command line and here I see a string of aacute. [?]

I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.

Oddly though, when I print "\255" I don't get an aacute, I get an o grave.

It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?

Many thanks.

John Machin · Answer 1 · 2011-04-17T01:53:49.323

The first tip is not to print wildly to all possible output mechanisms using various unstated encodings. Find out exactly what you have got. Do this:

print repr(the_line_with_the_problem) # Python 2.x
print(ascii(the_line_with_the_problem)) # Python 3.x

and edit your question and copy/paste the result.

Second tip: When asking for help, give information about your environment:

What version of Python? What version of what operating system?

Also show locale-related info; following example is from my computer running Python 2.7 in a Windows 7 Command Prompt window::

>>> import sys, locale
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_AU', 'cp1252')
>>>

Third tip: Don't use your own jargon ... the concepts "Simile Exhibit", "printed to the command line", and "compile statement" need explanation.

What is the relevance of "\255"? Where did you get that from?

Wild guesses while waiting for some facts to emerge:

(1) The offending character is U+00A0 NO-BREAK SPACE aka NBSP which appears in your text as "\xA0" and when sent to stdout in a Western European locale on Windows using a Command Prompt window would be treated as being encoded in cp850 and thus appear as a-acute. How this could be transmogrified into o-grave is a mystery.

(2) "\255" == \xAD implies the offending character is U+00AD SOFT HYPHEN but why this would be seen as o-grave is a mystery, and it's not "whitespace"; it shouldn't be shown at all, and it it is shown it should be as a hyphen/minus-sign, not a space.

Thanks heaps. You sorted my problem. I have compiled a search term for "\xA0" and then I can delete the offending NBSP. And it seems we can't use newline here without sending. This is uncomfortable! I've also learned how to display the offending code and learned that the encoding changes when I redirect output from the Commmand Prompt window to a textfile and v.v. Thanks very much. — jobucks, Apr 17 '11 at 12:10
@jobucks: Where are the facts? "compiled a search term" means what? re.compile()?? "can't use newline here without sending"?? Please explain. Also, what was all that about "\255" and o-grave? — John Machin, Apr 17 '11 at 22:10

PDFtotext - whitespace showing as aacute on commandline

1 Answers1