Tomcat 6 and foreign characters (UTF-8 and non UTF-8 mixup)

Question

I have a problem displaying foreign language characters in web app with Tomcat 6. Previously we were using Tomcat 5.5 and we did not face this issue. To fix this issue I followed http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8 and made changes to my web app accordingly. Now the web app supports URF-8 encoding and most of the issues with encoding are fixed. The bigger problem that I am facing now is with the old data that is stored in the database using the app with Tomcat 5.5. The old data is not displayed correctly on the webapp with UTF-8 encoding. What is the best way to to have the webapp configured with UTF-8 encoding and also to be able to display the old data which are not UTF-8 encoded?

Also, the I am not able to determine what encoding was used previously. Thanks.

What DB are you using? Most DBs just provides commands to convert the table's charset. — BalusC, Sep 21 '11 at 14:26
We are using Oracle 11g. Is there a way to determine the encoding of the data? If I can find that then I can take help from our database team. — bluetech, Sep 21 '11 at 14:38
Not necessary. If the data looks fine in the DB itself (and is thus not Mojibake over there), then it's just stored in DB's own charset. You only have to migrate it to UTF-8. Send this to your DBA: http://download.oracle.com/docs/cd/B28359_01/server.111/b28298/ch11charsetmig.htm#autoId7 — BalusC, Sep 21 '11 at 14:47
Let me know if you succeeds, then I'll turn the comments into an answer so that it can be accepted. — BalusC, Sep 21 '11 at 15:04
Update: DBAs would not do it because they had done it before and they got many other problems. I am still looking for a way to fix that data. I was able to figure out that the non-UTF-8 data is actually ISO-8859-1 encoded (it may be the default encoding for Tomcat web container). I could convert them to UTF-8 if I know the encoding of that data. But the real problem is to find out all that data which is not displayable and convert them to UTF-8. — bluetech, Oct 04 '11 at 17:53
It's still not clear if the DB table itself is using UTF-8 or ISO-8859-1. It's not possible to store UTF-8 data in an ISO-8859-1 table. — BalusC, Oct 04 '11 at 18:03
The encoding of tables is AL32UTF8. When I store a UTF-8 string and view it using Oracle SQL Developer I can see it properly but if it is ISO-8859-1 encoded string then I see it as garbage characters. I also compared two differently encoded strings using dump function and they were different (they appear similar if displayed properly). — bluetech, Oct 04 '11 at 19:28
OK, then the DB conversion/migration is not needed at all. You just have to select only the malformed data, interpret it as ISO-8859-1 and then re-insert it as UTF-8. The hard part is now only figuring which rows hold the data which was originally in ISO-8859-1. Does the tables have row sequences? — BalusC, Oct 04 '11 at 19:35
I think that's what I have to do. Manually find the bad data and fix it. I am still trying to think about some way to determine if the encoding of the stored data is UTF-8 or not. Since I am pretty sure that if it is not UTF-8 then it must be ISO-8859-1. I can write a script/patch which can be run on those tables to fix them. I am not really sure about row sequences. What are they? — bluetech, Oct 04 '11 at 20:22
The insert IDs (called `sequence` in Oracle DBs). If it's in order, then you could only select a specific subset which was been inserted during the "bad" period. — BalusC, Oct 04 '11 at 20:23
I will have to figure that out. I will post if I find any better solution to this problem. Thanks @BalusC. — bluetech, Oct 04 '11 at 20:32

Tomcat 6 and foreign characters (UTF-8 and non UTF-8 mixup)

0 Answers0