1

So... I have a STL vector that I need to search/filter with a user-provided string. (Just mentioning this in case there's a specific/better way to do it in this particular use case)

Currently (this code is old) it's being done by just iterating through it and regex matching each element to see if it matches.

Our problem, however, stems from accented characters. Our desired behavior is for the search to match strings without regard to diacritics (i.e. "telefono" also matches "teléfono" and vice-versa)

Is there a decent way to do this, ideally without having to resort to libraries other than boost?

2 Answers2

0

It would be helpful to know what the character encoding is when asking questions about string matching i.e. UTF-8 etc. That being said one of the approaches when dealing with diacritics is to substitute them for the plain character equivalent before doing the string compare. Your database of matches would not contain any diacritics and you would sanitize your search input string before comparing.

Justin Randall
  • 2,243
  • 2
  • 16
  • 23
  • Like I said above, this would be my plan B. It's a list of songs (artist/title) so I'd like to retain the diacritics for display. I thought of perhaps adding a couple of members to our struct to represent sanitized versions of the artist and title (kind of like what iTunes does) but I'd like to avoid that approach if possible. – Gregorio Litenstein Dec 05 '17 at 23:00
0

Short answer: You "normalize" both strings and then do the search/comparison.

Note that Unicode represents many accented characters in more than one way. There is a single codepoint (U+00E9 LATIN SMALL E WITH ACUTE ACCENT) to represent the character with the accent, but it can also be represented by a combination of codepoints (U+0065 LATIN SMALL LETTER E and U+0301 COMBINING ACUTE ACCENT). The general way to deal with this is to choose one Normal Form C (for pre-composed characters) or D (for de-composed characters). Normalizing can be more complex than it seems. Once both strings are in the same normal form, you can compare them directly.

If you want to ignore the diacritics altogether, you can make up your own normalization scheme. For example, you can decompose any pre-composed characters and then drop all the combining codepoints. The will allow the base character to match an accented character regardless of how the accented character was originally represented.

There are also "kompatibility" normal forms in Unicode (KC and KD) which substitute most special characters with the most common similar base character. In the case of diacritics, I think this'll do the same thing. So if you have a Unicode library, you might be able to use it to do all the hard work of normalizing.

In many cases, the database is already in some normal form, so you just have to normalize the search string.

If all that is too complicated, another approach would be to build a regex that will match any representation. For example, if your search key is telefono, you'd turn that into a regex like t(e|\u00E9|e\u0301)l(e|\u00E9|e\u0301)f(o|\u00F3|o\u0301)n(o|\u00F3|o\u0301). Those regexes can be bulky pretty fast, depending on how flexible you want the matches to be.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • It needs to be _flexible_ because it's basically a list of songs and the search is user input, so it could literally be almost anything. I just realized I can use ICU because other things in our app are already depending on it, so it wouldn't be adding more dependencies. I gotta figure out how to do that though. – Gregorio Litenstein Dec 05 '17 at 23:03