6

I have a database containing the names of Premiership footballers which I am reading into R (3.02), but am encountering difficulties when it comes to players with foreign characters in their names (umlauts, accents etc.). The code below illustrates this:

PlayerData<-read.table("C:\\Users\\Documents\\Players.csv",quote=NULL, dec = ".",,sep=",", stringsAsFactors=F,header=T,fill=T,blank.lines.skip = TRUE)
Test<-PlayerData[c(33655:33656),] #names of the players here are "Cazorla" "Özil"

Test[Test$Player=="Cazorla",] #Outputs correct details
Test[Test$Player=="Ozil",] # Can not find data '0 rows> (or 0-length row.names)'
<

#Example of how the foreign character is treated:
substr("Özil",1,1)
[1] "Ã"
substr("Özil",1,2)
[1] "Ö"
substr("Özil",2,2)
[1] "
substr("Özil",2,3)
[1] "z

I have tried replacing the characters, as described here: R: Replacing foreign characters in a string, but as the accented characters in my example appear to be read as two seperate characters I do not think it works.

I would be grateful for any suggestions or workarounds.

The file is available for download here.

Community
  • 1
  • 1
Pash101
  • 631
  • 3
  • 14
  • Could you put the two lines of your CSV somewhere on the net? Maybe `iconv` can help. – Karsten W. Apr 18 '14 at 11:23
  • 1
    This requires a much longer answer (most of which outside my expertise), but try converting everything to UTF-8: `Test$Player <- iconv(Test$Player, to='UTF-8')`. See if the indexing works as expected. If you don't force an encoding, character strings will be interpreted depending on your system locale (the examples you provide worked as expected on my system). – ilir Apr 18 '14 at 11:24

2 Answers2

4

EDIT: It seems that the file you provided uses a different encoding than your system's native one.

An (experimental) encoding detection done by the stri_enc_detect function from the stringi package gives:

library('stringi')
PlayerDataRaw <- stri_read_raw('~/Desktop/PLAYERS.csv')
stri_enc_detect(PlayerDataRaw)
## [[1]]
## [[1]]$Encoding
## [1] "ISO-8859-1" "ISO-8859-2" "ISO-8859-9" "IBM424_rtl"
## 
## [[1]]$Language
## [1] "en" "ro" "tr" "he"
## 
## [[1]]$Confidence
## [1] 0.25 0.14 0.09 0.02

So most likely the file is in ISO-8859-1 a.k.a. latin1. Luckily, R does not have to re-encode the input while reading this file - it may just set a different than default (== native) encoding marking. You can load the file with:

PlayerData<-read.table('~/Desktop/PLAYERS.csv',
    quote=NULL, dec = ".", sep=",", 
    stringsAsFactors=FALSE, header=TRUE, fill=TRUE,
    blank.lines.skip=TRUE, encoding='latin1')

Now you may access individual characters correctly, e.g. with the stri_sub function:

Test<-PlayerData[c(33655:33656),]
Test
##           T          Away H.A    Home  Player Year
## 33655 33654 CrystalPalace   1 Arsenal Cazorla 2013
## 33656 33655 CrystalPalace   1 Arsenal    Özil 2013

stri_sub(Test$Player, 1, length=1)
## [1] "C" "Ö"
stri_sub(Test$Player, 2, length=1)
## [1] "a" "z"

As per comparing strings, here are the results for a test for equality of strings, with accent characters "flattened":

stri_cmp_eq("Özil", "Ozil", stri_opts_collator(strength=1))
## [1] TRUE

You may also get rid of accent characters by using iconv's transliterator (I am not sure whether it is available on Windows, though).

iconv(Test$Player, 'latin1', 'ASCII//TRANSLIT')
## [1] "Cazorla" "Ozil"

Or with a very powerful transliterator from the stringi package (stringi version >= 0.2-2):

stri_trans_general(Test$Player, 'Latin-ASCII')
## [1] "Cazorla" "Ozil"
gagolews
  • 12,836
  • 2
  • 50
  • 75
0

Thank you all for your help with this.

The strings had been encoded as UTF-8 correctly (I added the argument to read.table as well as using iconv, as suggested). This did not seem to be the issue.

I also used the stri_sub() function. but this also did not seem to work (it also treated the accent as a separate character stri_sub("Özil",1,3) = "Ã<U+0096>z").

However, thank you for pointing me in the direction of the stringi documentation, it gave me the idea for a workaround which I am happy to use:

remove.accents<-function(s){
oldrefs<-c(214,225)#Ö, á
newrefs<-c(79,97)#O,a

New<-utf8ToInt(s)
for(i in 1:length(oldrefs)){
New<-as.numeric(gsub(oldrefs[i],newrefs[i],New))
NEW<-intToUtf8(New)
}
NEW
}
> (remove.accents("Özil"))
[1] "Ozil"
> (remove.accents("Suárez"))
[1] "Suarez"

I can now populate the oldrefs/newref arrays with the Int references for the other characters I will need for certain players (Touré Jääskeläinen,Agüero etc.) which hopefully should not take too long!

Pash101
  • 631
  • 3
  • 14
  • Thanks for pointing out this interesting string processing task, I'll soon start working on including automatic transliteration mechanisms to stringi, see [this issue](https://github.com/Rexamine/stringi/issues/72) – gagolews Apr 19 '14 at 08:21
  • If `stri_sub` did not work correctly, I'm sure your data have not been read properly. What is the result of calling `Encoding(Test$Player)`? – gagolews Apr 19 '14 at 08:25
  • I've included the `encoding='UTF-8'` argument to `read.table` prior to importing. `Encoding(Test$Player)` now gives me this output: `"unknown" "UTF-8" ` (unknown in this case is Cazorla; UTF-8, the second player, is Özil). Additionally, passing the UTF-8 argument, means that Özil now appears as `zil` – Pash101 Apr 19 '14 at 10:40
  • Hmm... can you make this file available publicly somewhere and provide a link to it at SO? – gagolews Apr 19 '14 at 18:58
  • See my updated answer for a (hopefully) complete fix. – gagolews Apr 21 '14 at 09:57