1

I have two virtually equivalent strings. They look the same.

str1<-"Diş Hekimliği Fakültesi"
str2<-"Diş Hekimliği Fakültesi"

But when I try nchar() on them they return 26 and 23 characters respectively. And when I use strsplit();

strsplit(str1,split="")
[[1]]
 [1] "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"

strsplit(str2,split="")
[[1]]
 [1] "D" "i" "ş" " " "H" "e" "k" "i" "m" "l" "i" "ğ" "i" " " "F" "a" "k" "ü" "l" "t" "e" "s" "i"

Each language specific special character is counted as two characters. How can I make str1 into str2? My only manual solution was using gsub().

ps. Unfortunately I cannot bring this example to here in full. When you try to copy paste the code it will be both 23 characers. Something with copy-pasting here.

berkorbay
  • 443
  • 7
  • 22
  • Don't copy the console output, rather show us `dput(str1)`, `encoding(str1)`, `encoding(str2)`, and `dput(str2)`. – IRTFM Apr 04 '15 at 17:50
  • Unfortunately dputs give the same output. it is only visible when I do a strsplit or copy paste it to a text editor. I searched much about the encoding, ascii and stuff without any result. – berkorbay Apr 04 '15 at 17:53
  • Are they in the same encoding? I should have asked for `Encoding(str1)` and `Encoding(str2)` – IRTFM Apr 04 '15 at 17:55
  • I think I managed to create a MWE. Can you try it? (it also has one extra space character but it is still MWE) https://drive.google.com/open?id=0B162Fdn67bgVRlVfdFdUUm9XeUE&authuser=0 – berkorbay Apr 04 '15 at 18:00
  • And yes encodings are the same too. Both UTF-8. – berkorbay Apr 04 '15 at 18:01
  • 1
    Added the `iconv` tag to assist people who might be searching this topic. – IRTFM Apr 05 '15 at 00:08

1 Answers1

0

The iconv function is a system-specific function that manages transliterations among international encodings. There is a function iconvlist that can return a vector of the names that your OS facility uses; I ran through all 419 such encodings on my system with the help of sapply and try to see if I could get conversions of str1 (23 characters) to 26 or vice versa and found two such encodings on my machine. Since I use a Mac, I cannot give any assurances that these particular values will work for you, since you don't disclose your OS status:

I was able to put together an MWE with just the output from your strsplit-result from str2 above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
 "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

After many error messages (which do not stop execution because of the enclosing try(), I got a list of 2 encodings using this code:

?iconv
which(sapply( try(utils::head(iconvlist(), n = 419)), function(xc) 
                                                  try(nchar(iconv(str1, to=xc))))==26)
#--------snipped large number of error messages-------
Error in nchar(iconv(str1, to = xc)) : invalid multibyte string 1
UTF-8-MAC  UTF8-MAC 
      400       402 

Then thinking that the reverse might succeed (since str1 started as a 23-char object) I successfully tried:

> iconv(str3c,from="UTF-8-MAC", to="UTF-8")
[1] "Diş Hekimliği Fakültesi"
> nchar(iconv(str3c,from="UTF-8-MAC", to="UTF-8"))
[1] 23

Looking at the webpages for the Windows iconv is see that there is a listing for {10081, "x-mac-turkish"}, /* Turkish (Mac) */. If you are on Windoze perhaps that may be tried.

================

Earlier investigations below (I think it is useful to know how to pull apart character values.)

OK. I can actually put together an MWE with just your stuff above:

str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
#1: "D" "i" "s" "̧"   " " "H" "e" "k" "i" "m" "l" "i" "g" "̆"   "i" " " "F" "a" "k" "u" "̈"   "l" "t" "e" "s" "i"
#27: 
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"

Now to do some character hacking:

> ?charToRaw
> charToRaw(str3c)
 [1] 44 69 73 cc a7 20 48 65 6b 69 6d 6c 69 67 cc 86 69 20 46 61 6b 75 cc 88 6c 74 65
[28] 73 69
> charToRaw(str1)
 [1] 44 69 c5 9f 20 48 65 6b 69 6d 6c 69 c4 9f 69 20 46 61 6b c3 bc 6c 74 65 73 69

So look at the three Raw items that are representing your third letter. It appears that the second representation used a base character which backspaces it with a hex "cc" and then prints the descender. Now see if we can recognize them with regex:

 rawToChar( charToRaw(str3c) [3])
#[1] "s"
 rawToChar( charToRaw(str3c) [4])
#[1] "\xcc"
 rawToChar( charToRaw(str3c) [5])
#[1] "\xa7"
 grep("s\\xcc\\xa7", str3c)
#[1] 1   # Success!

And here's a gsub that I think is probably more efficient than what you ended up with if you were working with the split-versions of those words:

gsub("s\\xcc\\xa7", "\\c5\\9f", str3c)
#[1] "Diş Hekimliği Fakültesi"

Also note that there were actually 29 raw entries in the one R was telling you there were 26 "characters" (and 26 in the one that supposedly had 23). I think the three cc (backspaces) were not actually being counted.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Sorry but I already indicated I can already do that (last sentence before ps) :) My question was more towards a general solution in the future. The separation should be due to a reason and perhaps there is a package that handles these kind of inconveniences. – berkorbay Apr 04 '15 at 18:52
  • @berkorbay: I think `iconv` holds promise as the more general solution. – IRTFM Apr 04 '15 at 21:34
  • Tried that also. If I'm not wrong you need to know the exact source encoding to convert to the encoding of your desire. I also tried a package called `tau` to estimate the encoding to no avail. – berkorbay Apr 07 '15 at 09:31