Unicode string with diacritics split by chars

Question

I have this Unicode string: Ааа́Ббб́Ввв́ГгҐґДд

And I want to it split by chars. Right now if I try to loop truth all chars I get something like this:
A a a ' Б ...

Is there a way to properly split this string to chars: А а а́ ?

@Nivas doesn't really matter, `"а́"` is 2 characters from javascript's point of view. `"а" + "́" === "а́"` — Esailija, May 25 '12 at 17:57
@Esailija Nevermind. For whatever reason I thought this was a Java question. Did not read the tags(nor the title)... — Nivas, May 25 '12 at 18:01
@Nivas since ES6 became out, how you iterate actually makes the whole difference, because `for..of` uses `String.prototype[Symbol.iterator]`, which iterates in code point steps (sometimes more than one character long), while indexing using brackets doesn't. — ygormutti, Jul 10 '18 at 19:04

score 12 · Accepted Answer · answered May 27 '12 at 11:46

To do this properly, what you want is the algorithm for working out the grapheme cluster boundaries, as defined in UAX 29. Unfortunately this requires knowledge of which characters are members of which classes, from the Unicode Character Database, and JavaScript doesn't make that information available(*). So you'd have to include a copy of the UCD with your script, which would make it pretty bulky.

An alternative if you only need to worry about the basic accents used by Latin or Cyrillic would be to take only the Combining Diacritical Marks block (U+0300-U+036F). This would fail for other languages and symbols, but might be enough for what you want to do.

function findGraphemesNotVeryWell(s) {
    var re= /.[\u0300-\u036F]*/g;
    var match, matches= [];
    while (match= re.exec(s))
        matches.push(match[0]);
    return matches;
}

findGraphemesNotVeryWell('Ааа́Ббб́Ввв́ГгҐґДд');
["А", "а", "а́", "Б", "б", "б́", "В", "в", "в́", "Г", "г", "Ґ", "ґ", "Д", "д"]

(*: there might be a way to extract the information by letting the browser render the string, and measuring the positions of selections in it... but it would surely be very messy and difficult to get working cross-browser.)

score 9 · Answer 2 · answered Oct 04 '16 at 07:17

A little update on this.

As ES6 came by, there are new string methods and ways of dealing with strings. There are solutions for two problems present in this.

1) Emoji and surrogate pairs

Emoji and other Unicode characters that fall above the Basic Multilingual Plane (BMP) (Unicode "code points" in the range 0x0000 - 0xFFFF) can be worked out as the strings in ES6 adhere to the iterator protocol, so you can do like this:

let textWithEmoji = '\ud83d\udc0e\ud83d\udc71\u2764'; //horse, happy face and heart
[...textWithEmoji].length //3
for (char of textWithEmoji) { console.log(char) } //will log 3 chars

2) Diacritics

A harder problem to solve, as you start to work with "grapheme clusters" (a character and it's diacritics). In ES6 there is a method that simplify working with this, but it's still hard to work. The String.prototype.normalize method eases the work, but as Mathias Bynens puts:

(A) code points with multiple combining marks applied to them always result in a single visual glyph, but may not have a normalized form, in which case normalization doesn’t help.

More insight can be found here:

https://ponyfoo.com/articles/es6-strings-and-unicode-in-depth https://mathiasbynens.be/notes/javascript-unicode

This is the best answer since ES6 came out. Could mention `Array.from` which also uses String iterator for completeness sake. — ygormutti, Jul 10 '18 at 19:07
Now I see this is not exactly what the OP asked for, but perfect for the issue that brought me here (surrogate pairs). The question title needs an improvement. — ygormutti, Jul 10 '18 at 19:14
Great answer for splitting emojis. `"❤".length` is 5, but using a spread operator `[..."❤"].length` is 3, amazing. — WSBT, Dec 19 '19 at 19:42

score 8 · Answer 3 · answered Aug 19 '16 at 18:53

8

This package might help you: https://www.npmjs.com/package/runes

const runes = require('runes')

const example = 'Emoji '
example.split('') // ["E", "m", "o", "j", "i", " ", "�", "�"] 
runes(example)    // ["E", "m", "o", "j", "i", " ", ""]

answered Aug 19 '16 at 18:53

Vitaly Domnikov

306
2
4

Using Grapheme's (see my answer) even emoji's are split/found right. (tested on Firefox and V8) – Clemens Tolboom Feb 28 '22 at 08:23

score 0 · Answer 4 · answered Aug 13 '19 at 20:18

0

If you're writing an application that needs to consume chunks of data from a Node.js stream, then you can probably just pipe through utf8-stream to prevent this:

https://github.com/substack/utf8-stream

answered Aug 13 '19 at 20:18

Zach Bloomquist

5,309
29
44

score 0 · Answer 5 · answered Feb 24 '22 at 13:47

0

Using the Unicode properties Grapheme_Base

"Ааа́Ббб́Ввв́ГгҐґДд".match(/\p{Grapheme_Base}/gu)
> ['А', 'а', 'а', 'Б', 'б', 'б', 'В', 'в', 'в', 'Г', 'г', 'Ґ', 'ґ', 'Д', 'д']

and Grapheme_Extend

"Ааа́Ббб́Ввв́ГгҐґДд".match(/\p{Grapheme_Extend}/gu)
> ['́', '́', '́']

combining these into

"Ааа́Ббб́Ввв́ГгҐґДд".match(/\p{Grapheme_Base}\p{Grapheme_Extend}|\p{Grapheme_Base}/gu)
> ['А', 'а', 'а́', 'Б', 'б', 'б́', 'В', 'в', 'в́', 'Г', 'г', 'Ґ', 'ґ', 'Д', 'д']

answered Feb 24 '22 at 13:47

Clemens Tolboom

1,872
18
30

this won't catch multiple diacritics. try: "אַּׁ".match(/\p{Grapheme_Base}\p{Grapheme_Extend}*/gu) – o17t H1H' S'k Jan 25 '23 at 13:28
If you do `"אַּׁ".match(/\p{Grapheme_Base}/gu)` apart from `"אַּׁ".match(/\p{Grapheme_Extend}/gu)` you get matches. The latter is ['ׁ', 'ּ', 'ַ'] Not sure why your */star is not working. What language is this? – Clemens Tolboom Jan 27 '23 at 15:26
Hebrew. what is not working? – o17t H1H' S'k Jan 27 '23 at 16:16

Aleš Kotnik · Answer 6 · 2012-05-26T09:26:05.463

-1

The problem of your string are surogate pairs ("a" "́) which get combined to signle character only when displayed by browser. For your case, it is enough if you attach \u0301 to the previous character but this is by no means a general solution.

var a="Ааа́Ббб́Ввв́ГгҐґДд",
    i =0,
    chars=[];

while(a.charAt(i)) {
  if (a.charAt(i+1) == "\u0301") {
    chars.push(a.charAt(i++)+a.charAt(i++));
  } else {
    chars.push(a.charAt(i++));}}

To clarify the issue, go and read Mathias Bynens's blog post.

edited May 26 '12 at 09:26

answered May 25 '12 at 17:36

Aleš Kotnik

2,654
20
17

Your code is deeply flawed -- and besides having a bug, `a.fromCharCode(i)`, really? -- it doesn't do composition, so you're back to square 1... – dda May 25 '12 at 18:05
Thanx for the warning. Corrected. – Aleš Kotnik May 25 '12 at 18:23
1

Doesn't `charCodeAt(index)` work in terms of UTF-16 code units? So this wouldn't work for anything outside the BMP. – bames53 May 25 '12 at 18:30
The question was how to split unicode string to array of single unicode characters and the code does just this. Check the `chars` array. – Aleš Kotnik May 25 '12 at 18:33
`chars` array still returns every separate char and doesn't combine `"а" + "́" === "а́"` – Gapipro May 26 '12 at 07:33
1

Surrogate pairs are a totally different thing to combining characters. Surrogates are when, in UTF-16, two successive 16-bit values combine to make one 32-bit codepoint. Combining characters are full codepoints which combine with a previous base codepoint to form one user-perceived character called a "grapheme cluster". – hippietrail Jun 07 '12 at 05:43

Unicode string with diacritics split by chars

6 Answers6

1) Emoji and surrogate pairs

2) Diacritics

Linked

Related