The iconv
function is a system-specific function that manages transliterations among international encodings. There is a function iconvlist
that can return a vector of the names that your OS facility uses; I ran through all 419 such encodings on my system with the help of sapply
and try
to see if I could get conversions of str1 (23 characters) to 26 or vice versa and found two such encodings on my machine. Since I use a Mac, I cannot give any assurances that these particular values will work for you, since you don't disclose your OS status:
I was able to put together an MWE with just the output from your strsplit
-result from str2
above:
str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
"D" "i" "s" "̧" " " "H" "e" "k" "i" "m" "l" "i" "g" "̆" "i" " " "F" "a" "k" "u" "̈" "l" "t" "e" "s" "i"
#27:
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"
After many error messages (which do not stop execution because of the enclosing try()
, I got a list of 2 encodings using this code:
?iconv
which(sapply( try(utils::head(iconvlist(), n = 419)), function(xc)
try(nchar(iconv(str1, to=xc))))==26)
#--------snipped large number of error messages-------
Error in nchar(iconv(str1, to = xc)) : invalid multibyte string 1
UTF-8-MAC UTF8-MAC
400 402
Then thinking that the reverse might succeed (since str1 started as a 23-char object) I successfully tried:
> iconv(str3c,from="UTF-8-MAC", to="UTF-8")
[1] "Diş Hekimliği Fakültesi"
> nchar(iconv(str3c,from="UTF-8-MAC", to="UTF-8"))
[1] 23
Looking at the webpages for the Windows iconv is see that there is a listing for {10081, "x-mac-turkish"}, /* Turkish (Mac) */
. If you are on Windoze perhaps that may be tried.
================
Earlier investigations below (I think it is useful to know how to pull apart character values.)
OK. I can actually put together an MWE with just your stuff above:
str1<-"Diş Hekimliği Fakültesi"
str3 <- scan(what="")
#1: "D" "i" "s" "̧" " " "H" "e" "k" "i" "m" "l" "i" "g" "̆" "i" " " "F" "a" "k" "u" "̈" "l" "t" "e" "s" "i"
#27:
#Read 26 items
> str3c <- paste0(str3, collapse="")
> nchar(str3c)
[1] 26
> str1
[1] "Diş Hekimliği Fakültesi"
Now to do some character hacking:
> ?charToRaw
> charToRaw(str3c)
[1] 44 69 73 cc a7 20 48 65 6b 69 6d 6c 69 67 cc 86 69 20 46 61 6b 75 cc 88 6c 74 65
[28] 73 69
> charToRaw(str1)
[1] 44 69 c5 9f 20 48 65 6b 69 6d 6c 69 c4 9f 69 20 46 61 6b c3 bc 6c 74 65 73 69
So look at the three Raw items that are representing your third letter. It appears that the second representation used a base character which backspaces it with a hex "cc" and then prints the descender. Now see if we can recognize them with regex:
rawToChar( charToRaw(str3c) [3])
#[1] "s"
rawToChar( charToRaw(str3c) [4])
#[1] "\xcc"
rawToChar( charToRaw(str3c) [5])
#[1] "\xa7"
grep("s\\xcc\\xa7", str3c)
#[1] 1 # Success!
And here's a gsub that I think is probably more efficient than what you ended up with if you were working with the split-versions of those words:
gsub("s\\xcc\\xa7", "\\c5\\9f", str3c)
#[1] "Diş Hekimliği Fakültesi"
Also note that there were actually 29 raw entries in the one R was telling you there were 26 "characters" (and 26 in the one that supposedly had 23). I think the three cc
(backspaces) were not actually being counted.