Unifying surrogate pairs in Japanese "dakuten" characters using R

Question

I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX.

One element from the vector is a:

> a
[1] "立ち上げる.mp3"

The corresponding element from the filename is b

> b
[1] "立ち上げる.mp3"

The problem is that they are not logically equal to each other in R:

> a == b
[1] FALSE

I already found out that this is a problem emerged from the surrogate pairs of Japanese "dakuten" characters (i.e. the げ character that was extended from け by adding additional dots). So they're in fact different from each other:

> iconv(a, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0092ã\u0082\u008b.mp3"
> iconv(b, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0091ã\u0082\u0099ã\u0082\u008b.mp3"

> nchar(a)
[1] 9
> nchar(b)
[1] 10

How do I convert these two versions of the same Japanese characters so that they can be matched validly (i.e. they should be the same) using R?

score 1 · Answer 1 · answered Nov 13 '17 at 07:15

There is an open-source bridge library to call ICU library RUnicode. You may normalize search key to NFD(Mac OS X style) when on Mac OS X.

It normalizes other Japanese letters like full-width and half-width katakana, which might or might not for your purpose.

Unifying surrogate pairs in Japanese "dakuten" characters using R

1 Answers1