Encoding issue on a Django project on Linux connected to an SQL Server DB

Question

I have a Django (1.2.x) project that is designed to support German and English languages. The project is hosted on a Linux box, behind Apache (2.x) using mod_wsgi. The database is hosted on an SQL Server 2005 running on Windows on a separate box. The easysoft SQL Server ODBC driver is used to connect the project with the database.

I will use one of the models in one of the applications in the project as an example for this question. This model in question contains a TextField. This field is translated into an NVARCHAR(MAX) column type in the table in SQL server. The encoding for the database is set to "Latin1_General_CI_AS". The easysoft unixODBC source is configured to use the ConvToUtf = 1 which essentially converts the data from UCS-2 encoding into UTF-8 encoding when returning it back to the application from the database. (I mention UCS-2 here because I've read and found that the SQL server stores Unicode data in UCS-2 encoding.)

However, when viewing data that is stored in the database through the admin panel, the German characters are transformed into weird symbols (this is visible both when viewing the data on the admin panel, as well as within APIs that return data in JSON format).

An example is the following German word: Geschäftsbedingungen. After it has been saved in the database, it comes out as: GeschÃ¤ftsbedingungen.

The version of Python running on the Linux box is Python 2.6. I am not sure what other information I should provide to be able to provide more context into the problem.

Apparently, I've tried a couple of things, to no avail. I am looking for any clues on how to go about fixing this problem. Any help with this will be greatly appreciated.

UPDATE: If the data, I've found, is saved directly into the database table by editing the table through the SQL Management software, the data displays fine on both the Django admin page as well as the API. This is puzzling. When the data is saved through the admin panel, the strange characters appear.

http://stackoverflow.com/questions/947077/using-pyodbc-on-linux-to-insert-unicode-or-utf-8-chars-in-a-nvarchar-mssql-field In my opinion pyodbc + FreeTDS is better solution, free and offten used by Django/Python developers.You must only remember to set client charset = UTF-8 in freetds.conf — baklarz2048, Mar 01 '11 at 14:57
I forgot to mention that I am using `pyodbc` with Django. I will disagree with your opinion about `FreeTDS` being a better solution. `FreeTDS` is severely limited. Due to the many limitations and problems we faced with `FreeTDS`, we had to purchase the commercial driver from easysoft. — ayaz, Mar 01 '11 at 17:37

score 1 · Answer 1 · answered Mar 07 '11 at 11:20

1

The Unicode (and UTF8) byte sequence for "ä" is \xc3\xa4, which in the single-byte world of Latin1 is "Ã¤".

That means that something, somewhere, thinks it's getting Latin1 encoding, but it's not.

One explanation is that your browser thinks it's displaying Latin1. I would check the Content-Type header you're getting from the web server, and see if it specifies a charset. Perhaps your DEFAULT_CHARSET setting in Django isn't set correctly.

answered Mar 07 '11 at 11:20

seb

3,646
26
21

Thanks very much for your reply, seb. I can confirm that I'm getting the following value for `Content-Type`: `text/html; charset=utf-8`. I have not explicitly defined the `DEFAULT_CHARSET` setting for the project (and assume that Django is using the default of `UTF-8`). – ayaz Mar 07 '11 at 13:18
If you look at the source code of the page where you're seeing the wrong thing, do you see HTML entities, or the actual characters? – seb Mar 07 '11 at 15:38
@seb: I see the actual characters in the source of the page. For example, in the admin when I am viewing an object with such a value, one textarea is displayed thus: `GeschÃ¤ftsbedingungen` – ayaz Mar 07 '11 at 17:44
In that case, it seems likely that this is indeed UTF-8, and that therefore a conversion like this has happened somewhere: double-byte "ä" -> two single bytes characters "Ã¤" -> two double-byte characters "Ã¤" The fact that you can "fix" this by re-editing the data though the SQL admin suggests that maybe your database is not actually storing Unicode at all; e.g. it's taking double bytes from Django, storing them as single bytes, then these single bytes are converted to two double bytes in ODBC. I don't know anything about SQL Server so the help I can offer is running dry here. – seb Mar 08 '11 at 11:34
Thanks. That's interesting. I also noticed that the `collation` under `DATABASE_OPTIONS` is set to `Latin1_General_CI_AS`. Could that cause any problems? The database on MSSQL server is set to `Latin1_General_CI_AS`, though. – ayaz Mar 15 '11 at 13:17
Though I think collation isn't going to make any difference in this case. What I am wondering is whether these problems will go away if the database in question is converted to use the UTF-16 character set by default instead of Latin1_General. – ayaz Mar 15 '11 at 14:10
As I said, I don't know anything about SQL Server, but it sounds like it may be storing stuff in single bytes; so yes, perhaps this is the solution. However, in your OP you said it was storing double bytes and using the Latin1_General collation, and as you imply, collation is a different issue from encoding. Note also that *converting* the database *may* end up converting all your single bytes wrongly anyway, depending on how the conversion works. Hopefully there is some kind of sane Microsoft pointy-clicky way of doing / checking this. – seb Mar 17 '11 at 12:44

Encoding issue on a Django project on Linux connected to an SQL Server DB

1 Answers1