Case insensitive search in Unicode in C++ on Windows

Question

I asked a similar question yesterday, but recognize that i need to rephase it in a different way.

In short: In C++ on Windows, how do I do a case-insensitive search for a string (inside another string) when the strings are in unicode format (wide char, wchar_t), and I don't know the language of the strings. I just want to know whether the needle exists in the haystack. Location of the needle isn't relevant to me.

Background: I have a repository containing a lot of email bodies. The messages are in different languages (japanese, german, russian, finnish; you name it). All the data is in Unicode format, and I load it to wide strings (wchar_t) in my C++ application (the bodies have been MIME decoded, so in my debugger I can see the actual japanese, german characters). I don't know the language of the messages since email messages doensn't contain that detail, also a single email body may contain characters from several languages.

I'm looking for something like wcsstr, but with the ability to do the search in a case insensitve manner. I know that it's not possible to do a 100% proper conversion from upper case to lower case, without knowing the language of the text. I want a solution which works in the 99% cases where it's possible.

I'm using Visual Studio 2008 with C++, STL and Boost.

As the Iiİı problem proves, you want to ignore more than case. You actually want an imprecise match, for instance you also want é=e. And æ=ae, so you cna't even do this on a character-by-character basis. — MSalters, Oct 26 '09 at 11:43

score 4 · Answer 1 · answered Oct 24 '09 at 21:38

4

You have to specify the language to do case insensitive comparison. For example in Turkish, 'i' is NOT the lower case letter corresponding to 'I'. If the language appears not to be specified, then the comparison is being done with an implicitly selected language.

answered Oct 24 '09 at 21:38

Mark Thornton

1,885
1
12
4

My question was apparantly too long this time. As I point out in my question, I'm well aware that I need to know the language to do it 100% properly. But since this technically impossible, I'm asking for a solution which will work 99% of the time. – Nitramk Oct 25 '09 at 14:04
What is the source of the strings that you are searching for? If they are provided by a user, then the user's locale is probably appropriate. Your question also doesn't explain why you think a case insensitive search is required. – Mark Thornton Oct 25 '09 at 19:41

score 1 · Accepted Answer · answered Oct 24 '09 at 12:36

1

Boost String Algorithms has an icontains() function template which may do what you need.

answered Oct 24 '09 at 12:36

Ferruccio

98,941
38
226
299

It will work with both wchar_t* and std::wstring types or anything derived from std::basic_string<>. – Ferruccio Oct 24 '09 at 13:16
But it will not work for Unicode in the general case. "ß" and "SS" should compare equal, but Boost String Algorithms doesn't handle this. – dalle Jun 05 '13 at 11:41

score 0 · Answer 3 · answered Oct 24 '09 at 12:44

0

You should use the ICU library which provides support for Unicode regular expressions which follow the Unicode rules for case-insensitive matching. The library is available as C/C++ and Java libraries. Many other languages such as Python support a wrapper for the ICU libraries.

answered Oct 24 '09 at 12:44

Michael Dillon

31,973
6
70
106

I don't want to bundle a new large library just to do this. I'm looking for a solution which is available in Boost or in the Windows APIs. – Nitramk Oct 24 '09 at 12:50
I downloaded http://download.icu-project.org/files/icu4c/4.2.1/icu4c-4_2_1-Win32-msvc9.zip to check, and the .lib files add up to about 200K and the DLLs add up to about 20M. That's not a lot in this day and age, and you may not actually need all of them for what you are doing. In any case, ICU is the right way to do Unicode. – Michael Dillon Oct 24 '09 at 13:52
Considering the scope of what I'm trying to do, what would the problem be with Ferruccios solution to which ICU solves? – Nitramk Oct 24 '09 at 16:52
The icontains documentation says that it handles case insensitive matches only within a single locale. Since you are dealing with messages in many languages, it might not work. Of course, if you have the language identity recorded along with the message, then you may be able to do it with icontains(). ICU is a full-blown solution to UNICODE text manipulation and using it pays off in the future when you can apply it to many other problems. – Michael Dillon Oct 24 '09 at 20:36
Well, as i mentioned in my question I don't know the language of the messages. And since it's theoretically impossible to do a 100% proper case conversion without knowing the language, I still don't understand what ICU adds over the icontains. To me it sounds like terrible engineeing to include a 20MB library to do a string search because I may need other parts from that 20MB library some time in the future. – Nitramk Oct 25 '09 at 14:09
According to the Boost docs in another answer, icontains() requires the locale to be specified. If you don't have a locale then ICU allows for a nonspecific case-mapping that is better than nothing at all. The UNICODE spec covers case algorithms here http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G33992 and that is what ICU implements. You can use the simple case mapping defined here http://userguide.icu-project.org/transforms/casemappings and if you don't want to use full regular expressions, you can do a search http://userguide.icu-project.org/collation/icu-string-search-service – Michael Dillon Oct 25 '09 at 14:54

score 0 · Answer 4 · answered Oct 24 '09 at 21:17

0

you could convert both needle and haystack to lowercase (or uppercase) then do the wcsstr().

answered Oct 24 '09 at 21:17

Serge Wautier

21,494
13
69
110

Case insensitive search in Unicode in C++ on Windows

4 Answers4