Decapitalize UTF-8 special characters in R

Question

After I scraped a list of names, I have the following name in R:

DAPHN\303\211 DE MEULEMEESTER

If I use the function tolower, all the letters are set to lowercase, but not the special characters. What is the best way to achieve this?

it works `tolower("DAPHN\303\211 DE MEULEMEESTER")` ddisplays `"daphné de meulemeester"` — Avinash Raj, Apr 17 '15 at 06:52
This may be locale related. What do you get from `Sys.getlocale();`? — bgoldst, Apr 17 '15 at 06:53
I get the standard "C", should I set this to something else? — Kasper Van Lombeek, Apr 17 '15 at 06:56

score 2 · Accepted Answer · answered Apr 17 '15 at 07:01

2

The reason is that your locale is C. Non-ASCII special characters and their letter-case classifications are not recognized under that locale. You should be able to get it to work by switching to a UTF-8 locale:

Sys.setlocale(locale='C');
## [1] "C/C/C/C/C/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphn\303\211 de meulemeester"
Sys.setlocale(locale='en_CA.UTF-8');
## [1] "en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.utf-8"
tolower('DAPHN\303\211 DE MEULEMEESTER');
## [1] "daphné de meulemeester"

en_CA.UTF-8 makes sense for me because I'm in Canada, but if you're in the United States (for example) you'll probably want en_US.UTF-8. I think for any country you should be able to replace the CA/US with your two-letter country code to get the most appropriate locale for your location.

answered Apr 17 '15 at 07:01

bgoldst

34,190
6
38
64

Thanks this works. But if I upload it to SQL now, some characters became even weirder, like a copyright symbol. Do I have to change a setting in SQL to? – Kasper Van Lombeek Apr 17 '15 at 07:07
What's your DBMS? How are you uploading? How are you viewing the text after the upload? – bgoldst Apr 17 '15 at 07:12
I am starting to figure it out: I changed my locale from "C" to "UTF-8", and I can now use some regex code such as [[:upper:]], which works as well on special characters. I changed the encoding in my MySQL database to UTF-8 and now the characters are stored correctly in the database. – Kasper Van Lombeek Apr 18 '15 at 10:37

score 2 · Answer 2 · answered Apr 17 '15 at 07:13

2

Without changing your system locale, you can do locale-aware text transformation using the stringi package:

library(stringi)
her_name <- "DAPHN\303\211 DE MEULEMEESTER"
stri_trans_tolower(her_name, locale="en_CA")

answered Apr 17 '15 at 07:13

drammock

2,373
29
40

score 0 · Answer 3 · answered Jul 01 '20 at 09:04

My problem has been moved here because there is a similar problem. You can also solve the problem by converting the character to a known character.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-tolower(x)
x
[1] "sn. İletİşİm bİlgİlerİnİz guncellenmistir."

Let me add it as a picture. Because it may not be the same on every computer.

Actually expected output:

When I suggested @drammock, I saw this.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-stri_trans_tolower(x, locale="tr_TR")
x
[1] "sn. iletişim bilgileriniz guncellenmıstır."

Again, I added the output of @drammock 's suggestion as a picture. The yellow areas in the picture are not the expected output.

As a result, I found the UTF code of the character that could not be corrected by "tolower ()" and turned it into a character that was smoothly corrected by "tolower ()". Then I used "tolower ()" again and got the expected output. Thank you to everyone.

x<-c("Sn. İLETİŞİM BİLGİLERİNİZ GUNCELLENMISTIR.")
x<-gsub("\u0130","I",x,useBytes = FALSE)
x<-tolower(x)
x
[1] "sn. iletişim bilgileriniz guncellenmistir."

Decapitalize UTF-8 special characters in R

3 Answers3

Linked

Related