Converting UTF-8 characters into properly ASCII characters

Question

I have the string "V\355ctor" (I think that's Víctor). Is there a way to convert it to ASCII where í would be replaced by an ASCII i?

I already have tried Iconv without success. (I'm only getting Iconv::IllegalSequence: "\355ctor")

Further, are there differences between Ruby 1.8.7 and Ruby 2.0?

EDIT: Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "V\355ctor") this seems to work but the result is Vctor not Victor

How far do you need to go? Do you just want to strip out accents or do you want to convert a Turkish `ı` to a simple `i` as well? — mu is too short, Nov 11 '13 at 20:15
the latter. I don't want to "ignore" the character, rather replace it by a simple `i` — Benedikt B, Nov 11 '13 at 20:27
Your input is not `UTF-8`, it is most likely `ISO-8859-1`. Not that it's the answer you need, but you won't be able to get sensible conversions if you start with the wrong assumptions about the input string's encoding. It needs to be correct to get the correct translations to ASCII — Neil Slater, Nov 11 '13 at 20:48
What @NeilSlater said. A byte with value octal 355/decimal 237 followed by a "c" is not legal in UTF-8, in which the "í" character is encoded as two bytes: octal 303/decimal 195 followed by octal 255/decimal 173. — Mark Reed, Nov 11 '13 at 20:53
Thank you Neil and Mark, however something like `Iconv.iconv("ISO-8859-1", "ASCII", "V\355ctor")` raises Iconv::IllegalSequence errors for me (I have tried a lot of combinations already). — Benedikt B, Nov 14 '13 at 10:08

Mark Thomas · Answer 1 · 2013-11-18T23:10:46.437

7

I know of two options.

transliterate from the I18n gem.

$ irb
1.9.3-p448 :001 > string = "Víctor"
 => "Víctor" 
1.9.3-p448 :002 > require 'i18n'
 => true 
1.9.3-p448 :003 > I18n.transliterate(string)
 => "Victor"

Unidecoder from the stringex gem.
```
Stringex::Unidecoder..decode(string)
```

Update:

When running Unidecoder on "V\355ctor", you get the following error:

Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with IBM437 string)

Hmm, maybe you want to first translate from IBM437:

string.force_encoding('IBM437').encode('UTF-8')

This may help you get further. Note that the autodetected encoding could be incorrect, if you know exactly what the encoding is, it would make everything a lot easier.

edited Nov 18 '13 at 23:10

answered Nov 12 '13 at 01:12

Mark Thomas

37,131
11
74
101

It seems like `Víctor` is `V\303\255ctor`, not `V\355ctor`? Your example works fine, but `V\355ctor` returns `V?or` for me. – Benedikt B Nov 14 '13 at 10:13
[I18n::InvalidLocale: :en is not a valid locale](http://stackoverflow.com/questions/31416559/i18ninvalidlocale-en-is-not-a-valid-locale) – A.D. Nov 22 '15 at 17:36

score 3 · Answer 2 · answered Nov 11 '13 at 20:35

What you want to do is called transliteration.

The most used and best maintained library for this is ICU. (Iconv is frequently used too, but it has many limitations such as the one you ran into.)

A cursory Google search yields a few ruby ICU wrappers. I'm afraid I cannot comment on which one is better, since I've admittedly never used any of them. But that is the kind of stuff you want to be using.

Converting UTF-8 characters into properly ASCII characters

2 Answers2