3

Given:

str1 = "é"   # Latin accent
str2 = "囧"  # Chinese character
str3 = "ジ"  # Japanese character
str4 = "e"   # English character

How to differentiate str1 (Latin accent characters) from rest of the strings?

Update:

Given

str1 = "\xE9" # Latin accent é actually stored as \xE9 reading from a file

How would the answer be different?

sbs
  • 4,102
  • 5
  • 40
  • 54

3 Answers3

3

I would first strip out all plain ASCII characters with gsub, and then check with a regex to see if any Latin characters remain. This should detect the accented latin characters.

def latin_accented?(str)
  str.gsub(/\p{Ascii}/, "") =~ /\p{Latin}/
end

latin_accented?("é")  #=> 0 (truthy)
latin_accented?("囧") #=> nil (falsy)
latin_accented?("ジ") #=> nil (falsy)
latin_accented?("e")  #=> nil (falsy)
Matt Brictson
  • 10,904
  • 1
  • 38
  • 43
  • "é" actually stored as "\xE9" reading from a file. I've updated my question. Would you help in that case? – sbs Jun 26 '15 at 18:44
  • In that case the file is probably encoded in ISO-8859-1. Read the file and convert it to UTF-8 before doing the regex check. `IO.read("myfile", :encoding => "ISO-8859-1:UTF-8")` – Matt Brictson Jun 26 '15 at 19:03
  • What if that str="\xE9" is something I'm not able to change. How could that be recognized? – sbs Jun 26 '15 at 19:13
  • There is a probably a better way to do it, but this will detect "\xE9" in str: `str.force_encoding("binary").include?("\xE9".force_encoding("binary"))` – Matt Brictson Jun 26 '15 at 20:18
1

Try to use /\p{Latin}/.match(strX) or /\p{Latin}&&[^a-zA-Z]/ (if you want to detect only special Latin characters).

By the way, "e" (str4) is also a Latin character.

Hope it helps.

codevolution
  • 146
  • 4
1

I'd use a two-stage approach:

  1. Rule out strings containing non-Latin characters by attempting to encode the string as Latin-1 (ISO-8859-1).
  2. Test for accented characters with a regular expression.

Example:

def is_accented_latin?(test_string)
  test_string.encode("ISO-8859-1")   # just to see if it raises an exception

  test_string.match(/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöùúûüýþÿ]/)
rescue Encoding::UndefinedConversionError
  false
end

I strongly suggest you select for yourself the accented characters you're attempting to screen for, rather than just copying what I've written; I certainly may have missed some. Also note that this will always return false for strings containing non-Latin characters, even if the string also contains a Latin character with an accent.

Wally Altman
  • 3,535
  • 3
  • 25
  • 33