Questions tagged [unicode-normalization]

Unicode normalization refers to the standardisation of Unicode strings. Normalization forms remove differences in the binary representation of identical Unicode strings.

200 questions
145
votes
7 answers

What is normalized UTF-8 all about?

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching. However, I'm trying to figure out what this means for applications. For example, in…
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
36
votes
6 answers

File.listFiles() mangles unicode names with JDK 6 (Unicode Normalization issues)

I'm struggling with a strange file name encoding issue when listing directory contents in Java 6 on both OS X and Linux: the File.listFiles() and related methods seem to return file names in a different encoding than the rest of the system. Note…
32
votes
2 answers

When to use Unicode Normalization Forms NFC and NFD?

The Unicode Normalization FAQ includes the following paragraph: Programs should always compare canonical-equivalent Unicode strings as equal ... The Unicode Standard provides well-defined normalization forms that can be used for this: NFC and…
Jesse Hallam
  • 6,794
  • 8
  • 48
  • 70
24
votes
3 answers

Unicode Normalization in Windows

I've been using "unicode strings" in Windows for as long as... I've learned about Unicode (e.g. after graduating). However, it always mystified me that the Win32API mentions "unicode" very loosely. In particular, "unicode" variant mentioned by MSN…
André Caron
  • 44,541
  • 12
  • 67
  • 125
22
votes
1 answer

How does unicodedata.normalize(form, unistr) work?

On the API doc, http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize. It says Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.` The documentation is rather…
alvas
  • 115,346
  • 109
  • 446
  • 738
21
votes
1 answer

How to properly Normalize a String with composite characters?

Java Normalize already allows me to take accented characters and output non-accented characters. It does not, however, seem to deal with composite characters (Œ, Æ) very well at all. Is there a way for Java to deal with these characters natively?…
Weckar E.
  • 727
  • 2
  • 5
  • 19
20
votes
5 answers

Normalizing unicode text to filenames, etc. in Python

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python? E.g. turn My International Text: åäö to my-international-text-aao plone.i18n does really good job, but unfortunately it depends on…
Mikko Ohtamaa
  • 82,057
  • 50
  • 264
  • 435
20
votes
4 answers

What is the best way to remove accents with Apache Spark dataframes in PySpark?

I need to delete accents from characters in Spanish and others languages from different datasets. I already did a function based in the code provided in this post that removes special the accents. The problem is that the function is slow because it…
20
votes
2 answers

What Unicode normalization (and other processing) is appropriate for passwords when hashing?

If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function? Goals Without normalization, if someone sets their password to "mañana" (ma\u00F1ana) on one computer and tries to log in with…
18
votes
2 answers

Text run is not in Unicode Normalization Form C

While I was trying to validate my site I get the following error: Text run is not in Unicode Normalization Form C A: What does it mean? B: Can I fix it with notepad++ and how? C: If B is no, How can I fix this with free tools(not dreamweaver)?
Randall Flagg
  • 4,834
  • 9
  • 33
  • 45
18
votes
1 answer

When a string is not a string? Unicode normalization weirdness in Javascript

I have run into what is, to me, some serious weirdness with string behavior in Firefox when using the .normalize() Unicode normalization function. Here is a demo, view the console in Firefox to see the problem. Suppose I have a button with an id of…
user2467065
17
votes
1 answer

How do I check equality of Unicode strings in Javascript?

I have two strings in Javascript: "_strange_chars_µö¬é@zendesk.com.eml" (f1) and "_strange_chars_µö¬é@zendesk.com.eml" (f2). At first glance, they look identical (and, indeed, on StackOverflow, they may be; I'm not sure what happens when they are…
James A. Rosen
  • 64,193
  • 61
  • 179
  • 261
17
votes
5 answers

Unicode string normalization in C/C++

Am wondering how to normalize strings (containing utf-8/utf-16) in C/C++. In .NET there is a function String.Normalize . I used UTF8-CPP in the past but it does not provide such a function. ICU and Qt provide string normalization but I prefer…
Ghassen Hamrouni
  • 3,138
  • 2
  • 20
  • 31
17
votes
5 answers

Javascript string comparison fails when comparing unicode characters

I want to compare two strings in JavaScript that are the same, and yet the equality operator == returns false. One string contains a special character (eg. the danish å). JavaScript code: var filenameFromJS = "Designhåndbog.pdf"; var…
17
votes
1 answer

Why isn't string.Normalize consistent depending on the context?

I have the following code: string input = "ç"; string normalized = input.Normalize(NormalizationForm.FormD); char[] chars = normalized.ToCharArray(); I build this code with Visual studio 2010, .net4, on a 64 bits windows 7. I run it in a unit tests…
remio
  • 1,242
  • 2
  • 15
  • 36
1
2 3
13 14