How to reduce a string to ASCII 7 characters for indexing purposes?

Question

I am working on an application which must index certain sentences. Currently using Java and PostgreSQL. The sentences may be in several languages like French and Spanish using accents and other non-ASCII symbols.

For each word I want to create an index-able equivalent so that a user can perform a search insensitive to accents (transliteration). For example, when the user searches "nacion" it must find it even if the original word stored by the application was "Nación".

What could be the best strategy for this? I am not necessarily restricted only to PostgreSQL, nor the internal indexed value needs to have any similarity with the original word. Ideally, it should be a generic solution for converting any Unicode string into an ASCII string insensitive to case and accents.

So far I am using a custom function shown below which naively just replaces some letters with ASCII equivalents before storing the indexed value and does the same on query strings.

public String toIndexableASCII (String sStrIn) {
  if (sStrIn==null) return null;
  int iLen = sStrIn.length();
  if (iLen==0) return sStrIn;
  StringBuilder sStrBuff = new StringBuilder(iLen);
  String sStr = sStrIn.toUpperCase();

  for (int c=0; c<iLen; c++) {
    switch (sStr.charAt(c)) {
      case 'Á':
      case 'À':
      case 'Ä':
      case 'Â':
      case 'Å':
      case 'Ã':
        sStrBuff.append('A');
        break;
      case 'É':
      case 'È':
      case 'Ë':
      case 'Ê':
        sStrBuff.append('E');
        break;
      case 'Í':
      case 'Ì':
      case 'Ï':
      case 'Î':
        sStrBuff.append('I');
        break;
      case 'Ó':
      case 'Ò':
      case 'Ö':
      case 'Ô':
      case 'Ø':
        sStrBuff.append('O');
        break;
      case 'Ú':
      case 'Ù':
      case 'Ü':
      case 'Û':
        sStrBuff.append('U');
        break;
      case 'Æ':
        sStrBuff.append('E');
        break;
      case 'Ñ':
        sStrBuff.append('N');
        break;
      case 'Ç':
        sStrBuff.append('C');
        break;
      case 'ß':
        sStrBuff.append('B');
        break;
      case (char)255:
        sStrBuff.append('_');
        break;
      default:
        sStrBuff.append(sStr.charAt(c));
    }
  }

  return sStrBuff.toString();
}

Interpreting the bytes as ASCII 7 would not provide the "information loss" that I want to achieve. I want "coraçón" to be the same as "coracon" so that it doesn´t matter whether the user puts the accents or not when searching. I do not need a spelling or proximity checker like Google "did you mean ...?" But I do need "é" == "e". — Serg M Ten, Feb 22 '17 at 13:05
The mapping you are asking about is called "transliteration." — Tom Blodget, Feb 22 '17 at 17:10
Thanks. I edited the question to add transliteration, also helped me to Google a few goodmatches. — Serg M Ten, Feb 22 '17 at 17:40

score 2 · Accepted Answer · answered Feb 22 '17 at 13:40

    String s = "Nación";

    String x = Normalizer.normalize(s, Normalizer.Form.NFD);

    StringBuilder sb=new StringBuilder(s.length());
    for (char c : x.toCharArray()) {
        if (Character.getType(c) != Character.NON_SPACING_MARK) {
            sb.append(c);
        }
    }

    System.out.println(s); // Nación
    System.out.println(sb.toString()); // Nacion

How this works: It splits up international characters to NFD decomposition (ó becomes o◌́), then strips the combining diacritical marks.

Character.NON_SPACING_MARK contains combining diacritical marks (Unicode calls it Bidi Class NSM [Non-Spacing Mark]).

If you want to just **compare** two strings, as opposed to storing canonicalized versions, a more robust solution is available; see http://stackoverflow.com/questions/12889760/sort-list-of-strings-with-localization — Mark Jeronimus, Feb 22 '17 at 13:47

GhostCat · Answer 2 · 2017-02-22T13:43:38.040

1

The one obvious improvement for your current code: use a Map<Character, Character> that you prefill with your mappings.

And then simply check if that Map has a mapping; of so; use that; otherwise use the original character.

And as Androbin explains, there are special maps that do not rely on objects, but work with primitive types, like this trove. So, depending on your solution and requirements; you could look into that.

edited Feb 22 '17 at 13:43

answered Feb 22 '17 at 13:01

GhostCat

137,827
25
176
248

Thankfully, there is Map#getOrDefault – Androbin Feb 22 '17 at 13:15
I recommend a primitive Map for efficiency – Androbin Feb 22 '17 at 13:16
there are for example FastUtil, HPPC, Koloboke and Trove – Androbin Feb 22 '17 at 13:39

How to reduce a string to ASCII 7 characters for indexing purposes?

2 Answers2