From time to time we have encountered a very strange encoding problem in Tomcat in our production environment.
I have not yet been able to pinpoint exactly where in the code the problem happens, but it involves the replacement of non ascii characters to approximated ascii characters.
For example replacing the character 'å' with 'a'. Since the site is in swedish, the characters 'å', 'ä' and 'ö' is quite common. But for some reason the replacement of the 'ö' character always works, so a string like "Köp inte grisen i säcken" becomes "Kop inte grisen i säcken", ie the 'ä' is not replaced as it should, while the 'ö' character is.
Some quick facts about the problem:
It happens very seldom (we have noticed it 3-4 times, the first time maybe 1-2 years ago).
A restart of the troubled server makes the problem go away (until the next time).
It has never occured on more then one front end server at the same time.
It doesn't always happen on the same front end server.
No user input on the front end is involved.
All front end servers connect to the same CMS and DB, with the relevant config being identical.
All front end servers have the same relevant configuration (linux config, tomcat config, java environment config like "file.encoding" etc), and are started using the same script (all according to the hosting/service provider).
All front end servers use the same exact war file for the site, and the same jar files.
No other encoding problems can be seen on the site while this character replacement problem occurs.
We have never been able to reproduce the problem in any other environment.
We use Tomcat 5.5 and Java 5, because of CMS requirements.
I can only think of two plausible causes for this behaivor:
The hosting provider sometimes starts/restarts the front end servers in a different way, maybe with another user account with other environment variables or other file access rights, or maybe using some other script than the normal one.
Some process running during Tomcat or webapp startup depends upon some other process, and sometimes (intermittantly but seldom) these two (or more) processes happen to run in an order that causes this encoding defect.
But even if 1 or 2 above is the case, it still doesn't explain fully what really happens. What exact difference could explain this? Since all of "file.encoding", "file.encoding.pkg" "sun.io.unicode.encoding", "sun.jnu.encoding" and all other relevant environment variables all match on all front end machines (verified visually using a debug page, while the problem was occuring).
Can someone think of some plausible explanation for this strange intermittent behaivor? Simply upgrading Tomcat and/or Java version is not really a relevant answer since we don't really know if that would solve the problem, and it still doesn't explain what the problem was. I'm more interested in understanding exactly what the problem is caused by.
Regards /Jimi
UPDATE:
I think I have found the code that performs the character replacements. On initiation (triggered by the first call to do a replacement) it builds a HashMap<Character, String>, and fills it like this:
lookup.put(new Character('å'), "a");
Then when it should replace characters for a String, it loops over each character and for each one does a lookup in the hash map with the charactar as the key, and if a replacement String is found it is used, otherwise the original character is used.
This part of the code is more then 3 years old, and written by a developer who is long gone. If I would rewrite this code today I would do something totally different, and that might even solve the problem. But it would still not explain exactly what happend. Can someone see some possible explanation?