Dealing with encoding inconsistencies/cleaning hidden characters from webpage

Question

I scraped the link below and I want to process the text for further analysis using Python. The segment at issue is "kwa vimada wake". I want to end up with text the corresponds to the way it is intended to display (and does display on my browser), as "kwa vimada wake". However, there are hidden characters around "vimada", which you can see if you copy the text and paste into a program like Notepad++. These mess with my tokenization and NLP processing (POS tagger doesn't recognize the word, for example) and seem not to stay consistent between my script and other programs (after using machine learning and then loading the results in my script, I end up with vimadaÃ, which it can't match with vimada�).

The webpage seems to be using UTF-8 encoding and my files are saved with UTF-8 encoding. If I could solve this issue and eliminate any strange/hidden characters, I would have no issues with consistency across files or using it as input into NLP tools.

My script is using # -- coding: utf-8 --

I would prefer to work with the text I've already downloaded because security changes to the site have made re-scraping it impractical. My database has it saved as "kwa âvimadaâ wake". The begin/end characters display in Notepad++ as three characters each: [â][PAD][SOS] and [â][PAD][SGCI].

I want to remove unicode white space/hidden characters and convert all variants of punctuation like apostrophes, quotation marks, hyphens, etc. into their ASCII equivalents. I would prefer to keep accented characters as is. However, not all accented characters are currently being interpreted correctly. Some are encoded incorrectly, some were changed on the website presumably due to software changes and show up as html code like é. So a simple deletion of a class of characters won't clean the data properly. I'm using python 2.7.

http://www.jamiiforums.com/threads/rais-dhaifu-ccm-uchaguzi-2015.459292/#post-6461865

Provide the relevant part of `print(ascii(open(filename, encoding='utf-8').read()))` (Python 3). What characters do you want to remove? (Unicode whitespace, everything except letters, everythings except ascii letters, etc). Mention your Python version. — jfs, May 21 '16 at 02:38
I want to remove unicode white space/hidden characters and convert all variants of punctuation like apostrophes, quotation marks, hyphens, etc. into their ASCII equivalents. I would prefer to keep accented characters as is. However, not all accented characters are currently being interpreted correctly. Some are encoded incorrectly, some were changed on the website presumably due to software changes and show up as html code like é. So a simple deletion of a class of characters won't clean the data properly. I'm using python 2.7. — Rouzbeh, May 22 '16 at 20:41
Sure, the question now fully address what is my desired features. Take a look at it. — Rouzbeh, May 23 '16 at 23:10

Dealing with encoding inconsistencies/cleaning hidden characters from webpage

0 Answers0