How to check if a string contains accented Latin characters like é in Ruby?

Question

Given:

str1 = "é"   # Latin accent
str2 = "囧"  # Chinese character
str3 = "ジ"  # Japanese character
str4 = "e"   # English character

How to differentiate str1 (Latin accent characters) from rest of the strings?

Update:

Given

str1 = "\xE9" # Latin accent é actually stored as \xE9 reading from a file

How would the answer be different?

I think you mean "\xE9" (double quotes) – Matt Brictson Jun 26 '15 at 18:57 — Matt Brictson, Jun 26 '15 at 18:57

score 3 · Accepted Answer · answered Jun 26 '15 at 02:22

3

I would first strip out all plain ASCII characters with gsub, and then check with a regex to see if any Latin characters remain. This should detect the accented latin characters.

def latin_accented?(str)
  str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end

latin_accented?("é")  #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e")  #=> nil (falsy)

answered Jun 26 '15 at 02:22

Matt Brictson

10,904
1
38
43

"é" actually stored as "\xE9" reading from a file. I've updated my question. Would you help in that case? – sbs Jun 26 '15 at 18:44
In that case the file is probably encoded in ISO-8859-1. Read the file and convert it to UTF-8 before doing the regex check. `IO.read("myfile", :encoding => "ISO-8859-1:UTF-8")` – Matt Brictson Jun 26 '15 at 19:03
What if that str="\xE9" is something I'm not able to change. How could that be recognized? – sbs Jun 26 '15 at 19:13
There is a probably a better way to do it, but this will detect "\xE9" in str: `str.force_encoding("binary").include?("\xE9".force_encoding("binary"))` – Matt Brictson Jun 26 '15 at 20:18

codevolution · Answer 2 · 2015-06-25T23:07:03.140

1

Try to use /\p{Latin}/.match(strX) or /\p{Latin}&&[^a-zA-Z]/ (if you want to detect only special Latin characters).

By the way, "e" (str4) is also a Latin character.

Hope it helps.

edited Jun 25 '15 at 23:07

answered Jun 25 '15 at 22:53

codevolution

146
4

score 1 · Answer 3 · answered Jun 25 '15 at 23:27

I'd use a two-stage approach:

Rule out strings containing non-Latin characters by attempting to encode the string as Latin-1 (ISO-8859-1).
Test for accented characters with a regular expression.

Example:

def is_accented_latin?(test_string)
  test_string.encode("ISO-8859-1")   # just to see if it raises an exception

  test_string.match(/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ]/)
rescue Encoding::UndefinedConversionError
  false
end

I strongly suggest you select for yourself the accented characters you're attempting to screen for, rather than just copying what I've written; I certainly may have missed some. Also note that this will always return false for strings containing non-Latin characters, even if the string also contains a Latin character with an accent.

How to check if a string contains accented Latin characters like é in Ruby?

3 Answers3