2

I need to join two tables on the colum name. They come from different excel spreadsheets which I didn't create. I imported both sheets into R tables using the function read.xls from the gdata package.

Sometimes the names contains umlauts and other accents. They appear identical within the excel sheets but when I import them into R they are not the same. Hence my join doesn't join. I join using the sqldf function.

As an example : I see Lück in the two spreadsheets. In table1 of R this appears as L\374ck whereas in table2 it appears as L\303\274ck.

How can I best solve this problem? Is there a way to force an encoding when data is imported? Or should one try to force the comparison of strings in a different way?

Geoff
  • 925
  • 4
  • 14
  • 36

2 Answers2

3

Since you stated you are not geting a TRUE on agstudy's code, I'm assuming you're using a Mac or a Linux Machine.

Try instead

x1=  "L\374ck"
x2 = "L\303\274ck"
iconv(x1,"latin1","UTF-8")
[1] "Lück"
identical(iconv(x1,"latin1","UTF-8"),x2)
[1] TRUE

Tested this on a Linux CentOS 7. I believe it should work on Macs too. If not look into

?iconv

for more details on encoding.

Serban Tanasa
  • 3,592
  • 2
  • 23
  • 45
  • Your assumption is correct. I didn't realise that information was relevant. Well deduced! – Geoff Dec 28 '14 at 18:38
1

It is an encoding problem. For example :

 x1=  "L\374ck"
 x2 = "L\303\274ck"
 identical(iconv(x2,"UTF-8","UTF-8"),x1)
 [1] "TRUE"

So you can try this for example before applying merge :

 df2name <- iconv(df2$name,'UTF-8','UTF-8')
agstudy
  • 119,832
  • 17
  • 199
  • 261