0

Has anyone dealt with using std::string functions for MBCS? For example in C I could do this:

p = _mbsrchr(path, '\\');

but in C++ I'm doing this:

found = path.find_last_of('\\');

If the trail byte is a slash then would find_last_of stop at the trail byte? Also same question for std::wstring.

If I need to replace all of one character with another, say all forward slashes with backslashes what would be the right way to do that? Would I have to check each character for a lead surrogate byte and then skip the trail? Right now I'm doing this for each wchar:

if( *i == L'/' )
*i = L'\\';

Thanks

Edit: As David correctly points out there is more to deal with when working with multibyte codepages. Microsoft says use _mbclen for working with byte indices and MBCS. It does not appear I can use find_last_of reliably when working with the ANSI codepages.

loop
  • 3,460
  • 5
  • 34
  • 57

1 Answers1

1

You don't need to do anything special about surrogate pairs. A single 16 bit character unit that is one half of a surrogate pair, cannot also be a non-surrogate character unit.

So,

if( *i == L'/' )
    *i = L'\\';

is perfectly correct.

Equally you can use find_last_of with wstring.

It's more complicated for multi-byte ANSI codepages. You do need to deal with lead and trail byte issues. My recommendation is to normalise to a more reasonable encoding if you really have to deal with multi-byte ANSI date.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Wait does the same apply to MBCS? – loop May 19 '12 at 20:06
  • David yes, the ANSI codepages as well. because I'm pretty sure MBCS can give slash characters in the trail byte. Do you know what I should do in that situation. And thanks for your answer so far – loop May 19 '12 at 20:18
  • Nothing to worry about in any ANSI codepage, `find_last_of` is fine there. Even with UTF-8 you can search for `'\\'` without worrying about multi-byte encoding issues. That's because all code points that require multiple bytes in UTF8 have 1 for their most significant bit. – David Heffernan May 19 '12 at 20:23
  • No, that's right. I'm forgetting the multi-byte ANSI codepages. – David Heffernan May 19 '12 at 20:28
  • So then don't use find_last_of with MBCS? Can you update your answer for that situation – loop May 19 '12 at 20:31
  • Yeah, done that. I would try to avoid MBCS though. Do you really need to deal with it? Can't you stick to UTF-16. – David Heffernan May 19 '12 at 20:32
  • Yeah I have to write something that can be compiled with or without UNICODE/_UNICODE. What I'll probably do is call the UNICODE versions of the API functions regardless and then I'll do whatever I need to do and convert to ANSI codepage if necessary. Thanks – loop May 19 '12 at 20:36