After I scraped a list of names, I have the following name in R:
DAPHN\303\211 DE MEULEMEESTER
If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?
After I scraped a list of names, I have the following name in R:
DAPHN\303\211 DE MEULEMEESTER
If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?
The reason is that your locale is C. Non-ASCII special characters and their letter-case classifications are not recognized under that locale. You should be able to get it to work by switching to a UTF-8 locale:
Sys.setlocale(locale='C');
## [1] "C/C/C/C/C/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphn\303\211 de meulemeester"
Sys.setlocale(locale='en_CA.UTF-8');
## [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphné de meulemeester"
en_CA.UTF-8
makes sense for me because I'm in Canada, but if you're in the United States (for example) you'll probably want en_US.UTF-8
. I think for any country you should be able to replace the CA
/US
with your two-letter country code to get the most appropriate locale for your location.
Without changing your system locale, you can do locale-aware text transformation using the stringi
package:
library(stringi)
her_name <- "DAPHN\303\211 DE MEULEMEESTER"
stri_trans_tolower(her_name, locale="en_CA")
My problem has been moved here because there is a similar problem. You can also solve the problem by converting the character to a known character.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-tolower(x)
x
[1] "sn. İletİşİm bİlgİlerİnİz guncellenmistir."
Let me add it as a picture. Because it may not be the same on every computer.
Actually expected output:
When I suggested @drammock, I saw this.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-stri_trans_tolower(x, locale="tr_TR")
x
[1] "sn. iletişim bilgileriniz guncellenmıstır."
Again, I added the output of @drammock 's suggestion as a picture. The yellow areas in the picture are not the expected output.
As a result, I found the UTF code of the character that could not be corrected by "tolower ()" and turned it into a character that was smoothly corrected by "tolower ()". Then I used "tolower ()" again and got the expected output. Thank you to everyone.
x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-gsub("\u0130","I",x,useBytes = FALSE)
x<-tolower(x)
x
[1] "sn. iletişim bilgileriniz guncellenmistir."