1

I have a strange problem with converting special characters from HTML. I have a Django project where text is stored HTML-encoded in a MySQL database. This is necessary, because I don't want to lose any formatting of the text.

In a preliminary step I must do operational things on the text like calculating positions, so I need to convert it first and clear it from all HTML-Tags. This is done by BeautifulSoup:

convertedText = str(BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES))
convertedText = ''.join(BeautifulSoup(convertedText).findAll(text=True))

By working on my Django-default test-server everything works fine, but when I run it on my production server there are strange behaviors when converting special characters.

An example:

Test server

MySQL-Query gives me: <p>bassverst&auml;rker</p>

is correctly converted to: bassverstärker

Production server

MySQL-Query gives me: <p>bassverst&auml;rker</p>

This is is wrongly converted to: bassverst\ucc44rker

Somehow the &auml; is converted into \ucc44 and this results in a wrong character.

My configuration:

Test server:

  • Django build-in solution (python manage.py runserver)
  • BeautifulSoup 3.2.1
  • Python 2.6.5
  • Ubuntu 2.6.32-43-generic

Production server:

  • Cherokee 1.2.101
  • BeautifulSoup 3.2.1
  • python 2.7.3
  • Ubuntu 3.2.0-32-generic

Because I don't know at which level the error occurs, I would like to ask if anybody can help me with this. Many thanks in advance.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
noplacetoh1de
  • 219
  • 3
  • 12

1 Answers1

4

I found a way to fix this. I didn't know that BeautifulSoup has the builtin method getText(). When converting HTML through:

convertedText = BeautifulSoup(text.text, convertEntities=BeautifulSoup.HTML_ENTITIES).getText()

eveything works fine on both servers. Although this works, it would be interesting to know why both servers are behaving differently when working with the example in the question.

However, thanks to all.

noplacetoh1de
  • 219
  • 3
  • 12