0

We are having some encoding issues which make the text look different in different browsers. Consider this jsfiddle in different browsers:

https://jsfiddle.net/w3297yLt/

The text should look correctly like this:

Apple Museum je první muzeum svého druhu v České republice, 
které bylo nedávno otevřeno v Husově ulici v centru Prahy. 
Můžete zde nahlédnout do nedávné minulosti a vžít se do doby, 
kdy Steve Jobs sestrojil spolu se Stevem Wozniakem v garáži 
svých rodičů první osobní ...

Note that this is not a font issue, this happens with fonts which are completely sound.

Chrome (note that it brakes even non-diacritics characters, check word garáži):

enter image description here

Firefox:

enter image description here

Safari (similar to Chrome, but the problem with garáži doesn't happen):

enter image description here

On first look, the text looks correct, but there seems to be some issues with it. With firefox on our website it even looks more weird (https://goout.net/cs/muzea/apple-museum/wucb/):

enter image description here

My impression is that the font is actually split in characters and the diacritics. But how can I fix this? Is there some algorithm or tool? We are using Java, so we will have to implement it in it.

Vojtěch
  • 11,312
  • 31
  • 103
  • 173
  • Towards the latter Firefox instance: what text/html editor are you using? The [text is not normalised but decomposed](http://www.unicode.org/reports/tr15/#Norm_Forms). For instance, `m e ̌ s ̌ t ̌ a n s k e ́ m` instead of `m ě š ť a n s k é m` (added spaces between neighbouring glyphs to render combining accents properly). BTW, this question belongs rather on SuperUser… – JosefZ Feb 06 '17 at 21:13
  • See also [Text run is not in Unicode Normalization Form C](http://stackoverflow.com/q/5465170/3439404). _To improve interoperability, the W3C recommends the use of NFC normalized text on the Web._ – JosefZ Feb 06 '17 at 21:44
  • This text was just copy-pasted from another site by our editors. They are just regular people and don't understand any technicalities behind. I need to implement something to repair the text structure, so that our editors don't have to worry. I am posting it here and not on superuser as I will be implementing a Java code to fix this. I will be happy to renormalize it, but I just don't know how. – Vojtěch Feb 06 '17 at 21:51
  • 1
    Oracle Java [Normalizing Text](https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html) tutorial? – JosefZ Feb 06 '17 at 22:10
  • I'm guessing you're using a web font which doesn't contain glyphs for some of the characters in your text. Browsers will then substitute glyphs from another font. That would explain the mix of different looking glyphs. – roeland Feb 06 '17 at 22:34
  • @JosefZ if you post this as an answer, I will accept it. – Vojtěch Feb 07 '17 at 04:17

1 Answers1

1

Towards the latter Firefox instance: the text is not normalised but decomposed while to improve interoperability, the W3C recommends the use of NFC normalised text on the Web (see Normalization in HTML and CSS).

By Oracle Java Normalizing Text tutorial, I'd advise to use the following normalize method:

normalized_string = Normalizer.normalize(target_chars, Normalizer.Form.NFC);

Cf. a complex NormSample.java source code, Copyright (c) 1995, 2008, Oracle and/or its affiliates. All rights reserved.


For instance, decomposed characters in the word "Můžete" (copy-pasted from Apple Museum) could be mistakenly rendered as

  • "M u ̊ z ̌ e t e" (8 decomposed characters) instead of
  • "M ů ž e t e" (6 precomposed characters).

(Note added spaces between neighbouring glyphs to render combining accents properly.)

Unfortunately, I can't give an example of the normalize method in Java; instead, here's PowerShell's congenial .Normalize method example:

PS D:\PShell> 'Může' | Get-CharInfo | Format-Table -AutoSize -Wrap

Char CodePoint        Category Description           
---- ---------        -------- -----------           
   M U+004D    UppercaseLetter Latin Capital Letter M
   u U+0075    LowercaseLetter Latin Small Letter U  
   ̊  U+030A     NonSpacingMark Combining Ring Above  
   z U+007A    LowercaseLetter Latin Small Letter Z  
   ̌  U+030C     NonSpacingMark Combining Caron       
   e U+0065    LowercaseLetter Latin Small Letter E  

PS D:\PShell> 'Může'.Normalize('FormC') | Get-CharInfo | Format-Table -AutoSize -Wrap

Char CodePoint        Category Description                         
---- ---------        -------- -----------                         
   M U+004D    UppercaseLetter Latin Capital Letter M              
   ů U+016F    LowercaseLetter Latin Small Letter U With Ring Above
   ž U+017E    LowercaseLetter Latin Small Letter Z With Caron     
   e U+0065    LowercaseLetter Latin Small Letter E                

PS D:\PShell> 

And here's Python's normalize method:

import unicodedata

unistr = 'Můžete'               # copy-pasted from Apple Museum
print ( 'decomposed', unistr)
print ( 'normalized', unicodedata.normalize('NFC', unistr))

See also this jsfiddle.

JosefZ
  • 28,460
  • 5
  • 44
  • 83