I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX.
One element from the vector is a
:
> a
[1] "立ち上げる.mp3"
The corresponding element from the filename is b
> b
[1] "立ち上げる.mp3"
The problem is that they are not logically equal to each other in R:
> a == b
[1] FALSE
I already found out that this is a problem emerged from the surrogate pairs of Japanese "dakuten" characters (i.e. the げ character that was extended from け by adding additional dots). So they're in fact different from each other:
> iconv(a, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0092ã\u0082\u008b.mp3"
> iconv(b, "latin1")
[1] "ç«\u008bã\u0081¡ä¸\u008aã\u0081\u0091ã\u0082\u0099ã\u0082\u008b.mp3"
> nchar(a)
[1] 9
> nchar(b)
[1] 10
How do I convert these two versions of the same Japanese characters so that they can be matched validly (i.e. they should be the same) using R?