1

After I scraped a list of names, I have the following name in R:

DAPHN\303\211 DE MEULEMEESTER

If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
Kasper Van Lombeek
  • 623
  • 1
  • 7
  • 17

3 Answers3

2

The reason is that your locale is C. Non-ASCII special characters and their letter-case classifications are not recognized under that locale. You should be able to get it to work by switching to a UTF-8 locale:

Sys.setlocale(locale='C');
## [1] "C/C/C/C/C/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphn\303\211 de meulemeester"
Sys.setlocale(locale='en_CA.UTF-8');
## [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphné de meulemeester"

en_CA.UTF-8 makes sense for me because I'm in Canada, but if you're in the United States (for example) you'll probably want en_US.UTF-8. I think for any country you should be able to replace the CA/US with your two-letter country code to get the most appropriate locale for your location.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • Thanks this works. But if I upload it to SQL now, some characters became even weirder, like a copyright symbol. Do I have to change a setting in SQL to? – Kasper Van Lombeek Apr 17 '15 at 07:07
  • What's your DBMS? How are you uploading? How are you viewing the text after the upload? – bgoldst Apr 17 '15 at 07:12
  • I am starting to figure it out: I changed my locale from "C" to "UTF-8", and I can now use some regex code such as [[:upper:]], which works as well on special characters. I changed the encoding in my MySQL database to UTF-8 and now the characters are stored correctly in the database. – Kasper Van Lombeek Apr 18 '15 at 10:37
2

Without changing your system locale, you can do locale-aware text transformation using the stringi package:

library(stringi)
her_name <- "DAPHN\303\211 DE MEULEMEESTER"
stri_trans_tolower(her_name, locale="en_CA")
drammock
  • 2,373
  • 29
  • 40
0

My problem has been moved here because there is a similar problem. You can also solve the problem by converting the character to a known character.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-tolower(x)
x
[1] "sn. İletİşİm bİlgİlerİnİz guncellenmistir."

Let me add it as a picture. Because it may not be the same on every computer.

tolower_my_computer

Actually expected output:

enter image description here

When I suggested @drammock, I saw this.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-stri_trans_tolower(x, locale="tr_TR")
x
[1] "sn. iletişim bilgileriniz guncellenmıstır."

Again, I added the output of @drammock 's suggestion as a picture. The yellow areas in the picture are not the expected output.

Stringi_output

As a result, I found the UTF code of the character that could not be corrected by "tolower ()" and turned it into a character that was smoothly corrected by "tolower ()". Then I used "tolower ()" again and got the expected output. Thank you to everyone.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-gsub("\u0130","I",x,useBytes = FALSE)
x<-tolower(x)
x
[1] "sn. iletişim bilgileriniz guncellenmistir."

expected_output

NCC1701
  • 139
  • 11