1

From time to time we have encountered a very strange encoding problem in Tomcat in our production environment.

I have not yet been able to pinpoint exactly where in the code the problem happens, but it involves the replacement of non ascii characters to approximated ascii characters.

For example replacing the character 'å' with 'a'. Since the site is in swedish, the characters 'å', 'ä' and 'ö' is quite common. But for some reason the replacement of the 'ö' character always works, so a string like "Köp inte grisen i säcken" becomes "Kop inte grisen i säcken", ie the 'ä' is not replaced as it should, while the 'ö' character is.

Some quick facts about the problem:

  • It happens very seldom (we have noticed it 3-4 times, the first time maybe 1-2 years ago).

  • A restart of the troubled server makes the problem go away (until the next time).

  • It has never occured on more then one front end server at the same time.

  • It doesn't always happen on the same front end server.

  • No user input on the front end is involved.

  • All front end servers connect to the same CMS and DB, with the relevant config being identical.

  • All front end servers have the same relevant configuration (linux config, tomcat config, java environment config like "file.encoding" etc), and are started using the same script (all according to the hosting/service provider).

  • All front end servers use the same exact war file for the site, and the same jar files.

  • No other encoding problems can be seen on the site while this character replacement problem occurs.

  • We have never been able to reproduce the problem in any other environment.

We use Tomcat 5.5 and Java 5, because of CMS requirements.

I can only think of two plausible causes for this behaivor:

  1. The hosting provider sometimes starts/restarts the front end servers in a different way, maybe with another user account with other environment variables or other file access rights, or maybe using some other script than the normal one.

  2. Some process running during Tomcat or webapp startup depends upon some other process, and sometimes (intermittantly but seldom) these two (or more) processes happen to run in an order that causes this encoding defect.

But even if 1 or 2 above is the case, it still doesn't explain fully what really happens. What exact difference could explain this? Since all of "file.encoding", "file.encoding.pkg" "sun.io.unicode.encoding", "sun.jnu.encoding" and all other relevant environment variables all match on all front end machines (verified visually using a debug page, while the problem was occuring).

Can someone think of some plausible explanation for this strange intermittent behaivor? Simply upgrading Tomcat and/or Java version is not really a relevant answer since we don't really know if that would solve the problem, and it still doesn't explain what the problem was. I'm more interested in understanding exactly what the problem is caused by.

Regards /Jimi

UPDATE:

I think I have found the code that performs the character replacements. On initiation (triggered by the first call to do a replacement) it builds a HashMap<Character, String>, and fills it like this:

lookup.put(new Character('å'), "a");  

Then when it should replace characters for a String, it loops over each character and for each one does a lookup in the hash map with the charactar as the key, and if a replacement String is found it is used, otherwise the original character is used.

This part of the code is more then 3 years old, and written by a developer who is long gone. If I would rewrite this code today I would do something totally different, and that might even solve the problem. But it would still not explain exactly what happend. Can someone see some possible explanation?

user1921254
  • 61
  • 1
  • 3
  • Sounds like a multithreading issue with some not threadsafe operations. I have seen similar problems with wrong conversions happening once every few weeks. Reason was always a not threadsafe access to shared data. Try to put some very heavy load to the machines, forcing alot of parallel character conversions, and see what happens. – Udo Klimaschewski Dec 21 '12 at 13:05
  • The problem is that once the server ends up in this troublesome mode, all these operations end up the same incorrect way, every single time until the server is restarted. So if multithreading is the cause, what exactly is it triggering, in your opinion? Also, subjecting the production front ends to heavy load would have a negative inpact on the web site visitors, and that is not something we are willing to do. – user1921254 Dec 21 '12 at 13:41
  • Without knowing the code, it is nearly impossible to say. Try to isolate the conversion routines and run them in a heavy multithreaded environment. – Udo Klimaschewski Dec 21 '12 at 14:21
  • I just updated the question with some code. Might have time to do some local multithreaded tests after the holidays, though I am still not convinced how exactly it could be the cause of this. Any theoretical explanation on that? – user1921254 Dec 21 '12 at 15:32

1 Answers1

1

Normalize the input to normal Form C, before doing the replacement.

For instance, ä can be just 1 character, U+00E4, or it can be two characters, a (U+0061) and the combining diaeresis U+0308.

If your replacement just looks for the composed form, then the decomposed form will still remain as \u0061\u0308 because neither of those match \u00e4:

public static void main(String args[]) {
    String decomposed = "\u0061\u0308";
    String composed = "\u00e4";

    System.out.println(decomposed);
    System.out.println(composed);
    System.out.println(composed.equals(decomposed));
    System.out.println(Normalizer
            .normalize(decomposed, Normalizer.Form.NFC).equals(composed));

}

Output

ä
ä
false
true
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • ok, this is interesting. Character encoding is a jungle of information. But... it still doesn't explain why this is happening intermittently. Ie once in a while, one of the front end servers end up in this strange "mode", where the same string data, from the same CMS, is treated differently every time, until that server is restarted. – user1921254 Dec 21 '12 at 13:46
  • @user1921254 yeah, but it's something to try. I cannot really diagnose it remotely... or even if I was physically there it would still be quite a venture – Esailija Dec 21 '12 at 13:54
  • A debug printout of the int values of all the characters in both the input String, and the replacement HashMap when this problem occurs, would help finding out if a mixture of composed and decomposed characters is involved, right? Because I have added that in the code. – user1921254 Dec 21 '12 at 15:48