I am extracting text using python from a textfile created from pdf using pdftotext. It is one of 2000 files and in this particular one, a line of keywords ends in EU. The remainder of the line is blank to the naked eye and so is the following line.
The program normally strips off any trailing blanks at the end of a line and ignores the subsequent blank line.
In this instance, it is saving the whitespace which is seen when it is printed out in at textfile between "EU. " and similarly in html (Simile Exhibit).
I also printed to the command line and here I see a string of aacute. [?]
I thought the obvious way to deal with this was to search and replace the accute. I've tried to do that with a compile statement and I've played with permutations of decoding the incoming text.
Oddly though, when I print "\255" I don't get an aacute, I get an o grave.
It seems likely with this odd combination of errors that I have misunderstood something fundamental. Any tips of how to begin unravelling this?
Many thanks.