-4

how do you remove surrogate values from a std::string in c++? looking for regular expression like this:

string pattern = u8"[\uD800-\uDFFF]";
regex regx(pattern);
name = regex_replace(name, regx, "_");

how do you do it in a c++ multiplatform project (e.g. cmake).

1 Answers1

1

First off, you can't store UTF-16 surrogates in a std::string (char-based), you would need std::u16string (char16_t-based), or std::wstring (wchar_t-based) on Windows only. Javascript strings are UTF-16 strings.

For those string types, you can use either:

  • std::remove_if() + std::basic_string::erase():

    #include <string>
    #include <algorithm>
    
    std::u16string str; // or std::wstring on Windows
    ...
    str.erase(
        std::remove_if(str.begin(), str.end(),
            [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
        ),
        str.end()
    );
    
  • std::erase_if() (C++20 and later only):

    #include <string>
    
    std::u16string str; // or std::wstring on Windows
    ...
    std::erase_if(str,
        [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
    );
    

UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if() for that task, eg:

#include <string>
#include <algorithm>

std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
    [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
    u'_'
);

Or, if you really want a regex-based approach, you can use std::regex_replace(), eg:

#include <string>
#include <regex>

std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
    str,
    std::wregex(L"[\\uD800-\\uDFFF]"),
    L"_"
);
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • I was under the impression that you can store strings with any encoding in `std:: string` or `char[]` , it's just that when you manipulate them you need specialized functions to interpret them in the specified encoding. Or is that true only for ANSI and UTF-8? – bolov Apr 21 '22 at 17:02
  • is it safe to use u16string in a multiplatform cmake project? – Rashid Mousavy Khoshrou Apr 21 '22 at 17:05
  • 3
    @bolov You can store raw binary bytes in a `std::string`, so yes, you could technically store a UTF-16 codeunit sequence in the memory of a `std::string`, but manipulating that data would be harder than if you had just used a proper UTF-16 string type to begin with. – Remy Lebeau Apr 21 '22 at 17:40
  • @RashidMousavyKhoshrou since `std::u16string` is a standard string type since C++11, I would assume yes for any C++11 compliant compiler/build chain. – Remy Lebeau Apr 21 '22 at 17:41
  • @RemyLebeau yes, that makes sense. – bolov Apr 21 '22 at 20:29
  • @RemyLebeau Thanks for the answer, I have to check this on windows, linux and mac. but right now seems compiler not accepting this regular expression u"[\uD800-\uDFFF]" on windows I get C3850 error, on ubuntu also not compiled. do you use a special option? – Rashid Mousavy Khoshrou Apr 21 '22 at 23:04
  • @RashidMousavyKhoshrou [Compiler Error C3850](https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-2/compiler-error-c3850) says: "*Characters represented as universal character names must represent valid Unicode code points in the range 0-10FFFF. **A universal character name cannot contain a value in the Unicode surrogate range, D800-DFFF**, or an encoded surrogate pair.*" This is also echoed in [String and character literals](https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp#surrogate-pairs). – Remy Lebeau Apr 21 '22 at 23:31
  • @RashidMousavyKhoshrou Also, it turns out `std::basic_regex` does not support `char16_t` strings anyway yet. So, I have updated my example. – Remy Lebeau Apr 21 '22 at 23:32
  • As I see strings are tricky when it comes to multi platform development. people are taking about don't using wide char strings and as I checked wide chars have different sizes on different platforms. is there a general multiplatform string solution here or I have to back to std::string ? – Rashid Mousavy Khoshrou Apr 23 '22 at 10:29
  • @RashidMousavyKhoshrou There are plenty of 3rd party Unicode libraries that are cross platform. – Remy Lebeau Apr 23 '22 at 15:33
  • @RemyLebeau thank you Remy, its been like 20 years since I code in c++. can you name best lightweight string options for multiplatform c++ development. – Rashid Mousavy Khoshrou Apr 24 '22 at 19:15
  • @RashidMousavyKhoshrou "best" is subjective. And asking for recommendations is off-topic for StackOverflow anyway. Do some research, try out some options for yourself, and see what works for your situation. – Remy Lebeau Apr 24 '22 at 19:36