3

What's the current best way to transliterate characters to 7-bit ASCII in Ruby? Most of questions I've seen on SO are 3 or 4 years old and the solutions don't fully work.

I want a method that will work for a wide range of Latin alphabets and, for example, convert

Your résumé’s a non–encyclopædia

to

Your resume's a non-encyclopaedia

but I cannot find a way that does that, particularly for folding 8-bit ASCII to 7-bit ASCII.

s =  "Your r\u00e9sum\u00e9\u2019s a non\u2013encyclop\u00e6dia"
puts Iconv.iconv('ascii//ignore//translit', 'utf-8', s)
# => Your r'esum'e's a non-encyclopaedia
puts s.encode('ascii//ignore//translit', 'utf-8')
# => Encoding::ConverterNotFoundError: code converter not found (UTF-8 to ascii//ignore//translit)
puts s.encode('ascii', 'utf-8')
# Encoding::UndefinedConversionError: U+00E9 from UTF-8 to US-ASCII
puts s.encode('ascii', 'utf-8', invalid: :replace, undef: :replace)
# Your r?sum??s a non?encyclop?dia
puts I18n.transliterate(s)
# Your resume?s a non?encyclopaedia

Since Iconv is deprecated I'd rather not use that if I don't have to, but I'd do it if that is the only thing that works. Obviously I could put in custom 8-bit ASCII to 7-bit ASCII translations, but I'd prefer to use a supported solution that has been thoroughly tested.

The translation is handled fine by International Components for Unicode with its Latin-ASCII translation, but that is only available for Java and C.

UPDATE

What I ended up doing was writing my own character translation routines to take care of punctuation and whitespace, after which I could use I18n.transliterate to do the rest. I'd still prefer finding and using a well-maintained library function to handle the stuff I18n does not.

Community
  • 1
  • 1
Old Pro
  • 24,624
  • 7
  • 58
  • 106

3 Answers3

5

If you're willing to add a somewhat heavy dependency (unless your already on Rails), ActiveSupport has support (pun not intended) for this:

ActiveSupport::Multibyte::Chars.new("Your r\u00e9sum\u00e9\u2019s not an encyclop\u00e6dia").mb_chars.normalize(:kd).chars.to_a.delete_if {|c| !c.ascii_only?}.join('')

This works for all of the letters. It doesn't handle the apostrophe right yet though.

Linuxios
  • 34,849
  • 13
  • 91
  • 116
  • I'm using Rails 3.2 but this doesn't work at all: `NoMethodError: undefined method 'delete_if' for #`. I suspect even if it's a syntax error that could be fixed up, all it will do is delete non-ASCII characters. What I want is to replace them with reasonable substitutes given that the input in basically English. Note also that I enhanced my example to include an en dash. – Old Pro Jun 18 '13 at 07:04
  • @OldPro: Honestly, your requirements are getting so tight I doubt any library could ever fufil them. And this solution does deal with the letters. See edit. – Linuxios Jun 18 '13 at 15:22
  • Thanks for the effort, but this deletes not only the punctuation but also the 'æ', which is not good enough for this project. – Old Pro Jun 18 '13 at 16:49
  • @OldPro: Glad to try ;). I hope you find something, but in the meantime I suggest you look at working with the original UTF-8 strings. – Linuxios Jun 18 '13 at 16:51
1

I guess the removeaccents script is just right what your want.

Maybe UnicodeUtils gem can be useful, but only to remove the accents (not to convert things like æ AFAIK).

Community
  • 1
  • 1
Sony Santos
  • 5,435
  • 30
  • 41
  • 1
    I'd much prefer to use the standard `I18n.transliterate` over the `removeaccents` script as I'm sure the former is more robust. The only problem with `I18n.transliterate` is that it only handles letters (and numbers?) and does not do anything to normalize punctuation. `removeaccents` has the same limitation. – Old Pro Dec 17 '14 at 00:36
  • @OldPro, I guess you made a good choice. Thank you for commenting it! – Sony Santos Dec 17 '14 at 18:02
0

Works with any locale:

  def normalized_text
    I18n.transliterate(text.downcase.strip)
  end
Dorian
  • 7,749
  • 4
  • 38
  • 57