Strange font encoding in all browsers

Question

We are having some encoding issues which make the text look different in different browsers. Consider this jsfiddle in different browsers:

https://jsfiddle.net/w3297yLt/

The text should look correctly like this:

Apple Museum je první muzeum svého druhu v České republice, 
které bylo nedávno otevřeno v Husově ulici v centru Prahy. 
Můžete zde nahlédnout do nedávné minulosti a vžít se do doby, 
kdy Steve Jobs sestrojil spolu se Stevem Wozniakem v garáži 
svých rodičů první osobní ...

Note that this is not a font issue, this happens with fonts which are completely sound.

Chrome (note that it brakes even non-diacritics characters, check word garáži):

Firefox:

Safari (similar to Chrome, but the problem with garáži doesn't happen):

On first look, the text looks correct, but there seems to be some issues with it. With firefox on our website it even looks more weird (https://goout.net/cs/muzea/apple-museum/wucb/):

My impression is that the font is actually split in characters and the diacritics. But how can I fix this? Is there some algorithm or tool? We are using Java, so we will have to implement it in it.

Towards the latter Firefox instance: what text/html editor are you using? The [text is not normalised but decomposed](http://www.unicode.org/reports/tr15/#Norm_Forms). For instance, `m e ̌ s ̌ t ̌ a n s k e ́ m` instead of `m ě š ť a n s k é m` (added spaces between neighbouring glyphs to render combining accents properly). BTW, this question belongs rather on SuperUser… — JosefZ, Feb 06 '17 at 21:13
See also [Text run is not in Unicode Normalization Form C](http://stackoverflow.com/q/5465170/3439404). _To improve interoperability, the W3C recommends the use of NFC normalized text on the Web._ — JosefZ, Feb 06 '17 at 21:44
This text was just copy-pasted from another site by our editors. They are just regular people and don't understand any technicalities behind. I need to implement something to repair the text structure, so that our editors don't have to worry. I am posting it here and not on superuser as I will be implementing a Java code to fix this. I will be happy to renormalize it, but I just don't know how. — Vojtěch, Feb 06 '17 at 21:51
Oracle Java [Normalizing Text](https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html) tutorial? — JosefZ, Feb 06 '17 at 22:10
I'm guessing you're using a web font which doesn't contain glyphs for some of the characters in your text. Browsers will then substitute glyphs from another font. That would explain the mix of different looking glyphs. — roeland, Feb 06 '17 at 22:34

score 1 · Accepted Answer · answered Feb 07 '17 at 16:31

Towards the latter Firefox instance: the text is not normalised but decomposed while to improve interoperability, the W3C recommends the use of NFC normalised text on the Web (see Normalization in HTML and CSS).

By Oracle Java Normalizing Text tutorial, I'd advise to use the following normalize method:

normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFC);

For instance, decomposed characters in the word "Můžete" (copy-pasted from Apple Museum) could be mistakenly rendered as

"M u ̊ z ̌ e t e" (8 decomposed characters) instead of
"M ů ž e t e" (6 precomposed characters).

(Note added spaces between neighbouring glyphs to render combining accents properly.)

Unfortunately, I can't give an example of the normalize method in Java; instead, here's PowerShell's congenial .Normalize method example:

PS D:\PShell> 'Může' | Get-CharInfo | Format-Table -AutoSize -Wrap

Char CodePoint        Category Description           
---- ---------        -------- -----------           
   M U+004D    UppercaseLetter Latin Capital Letter M
   u U+0075    LowercaseLetter Latin Small Letter U  
   ̊  U+030A     NonSpacingMark Combining Ring Above  
   z U+007A    LowercaseLetter Latin Small Letter Z  
   ̌  U+030C     NonSpacingMark Combining Caron       
   e U+0065    LowercaseLetter Latin Small Letter E  

PS D:\PShell> 'Může'.Normalize('FormC') | Get-CharInfo | Format-Table -AutoSize -Wrap

Char CodePoint        Category Description                         
---- ---------        -------- -----------                         
   M U+004D    UppercaseLetter Latin Capital Letter M              
   ů U+016F    LowercaseLetter Latin Small Letter U With Ring Above
   ž U+017E    LowercaseLetter Latin Small Letter Z With Caron     
   e U+0065    LowercaseLetter Latin Small Letter E                

PS D:\PShell>

And here's Python's normalize method:

import unicodedata

unistr = 'Můžete'               # copy-pasted from Apple Museum
print ( 'decomposed', unistr)
print ( 'normalized', unicodedata.normalize('NFC', unistr))

Strange font encoding in all browsers

1 Answers1