How to read a character not included in ascii in c++?

Question

I'm going through a folder of files editing the titles. I am trying to remove a certain piece of the title but the bracket used to separate in the title is not a standard ascii so I can't figure a way of removing it. This is a sample of the title: 【Remove this portion】keep this portion. I've included the coding I'm using. I'm using a cstring to store the title and then using cstring::find() to locate the portion but it is unable to locate that type of bracket.

    //sets definition
    HANDLE hfind;
    WIN32_FIND_DATA data;

    //creates string for to search for a specific file
    CString FileFormat = FolderPath + Format;
    CString NewTitle, PulledFile;

    //sets definition for retrieving first file
    hfind = FindFirstFile(FileFormat, &data);

    //runs loop if handle is good
    if (hfind != INVALID_HANDLE_VALUE)
    {
    //loops until it hits the end of the folder
    do {
        //adds filename to vector
        PulledFile = data.cFileName;
        if(PulledFile.Find(L'【') != -1)
        {
            while (PulledFile.Find(L'】') != -1)
            {
                PulledFile = PulledFile.Right(PulledFile.GetLength() - 1);
            }
        }
        NewTitle = PulledFile.Left(PulledFile.GetLength()-(Format.GetLength() + 9));
        if (sizeof(NewTitle) != NULL)
        {
            v.push_back(NewTitle);
        }
    } while (FindNextFile(hfind, &data));
    }

`if (sizeof(NewTitle) != NULL)` is very very wrong. What are you trying to do with this comparison? — andlabs, Dec 21 '15 at 02:49
It should be `NewTitle.GetLength()` instead of `sizeof(NewTitle)` And this part doesn't make any sense: `NewTitle = PulledFile.Left(PulledFile.GetLength() - (Format.GetLength() + 9));` It sets `NewTitle` to NUL. It's not a Unicode problem. — Barmak Shemirani, Dec 21 '15 at 03:35
@IInspectable he's not reading the file itself but getting a file name which is returned in one of two formats (either wide string or Unicode). Assuming you compile correctly, whatever encoding the file uses inside won't prevent you from doing what OP is trying to do. — meneldal, Dec 21 '15 at 04:29
@meneldal: Totally missed that, you are right. Except, you probably meant to say *"either MBCS or Unicode encoded"*. — IInspectable, Dec 21 '15 at 04:37
@IInspectable The MBCS is some kind of evil I'd rather not tread with anyway. Better just know that it exists and avoid it if you can. — meneldal, Dec 21 '15 at 04:44
@meneldal: I was referring to *"either wide string or Unicode"* - those are synonymous. — IInspectable, Dec 21 '15 at 04:47
@IInspectable I forgot that Unicode could also be wide strings indeed. This is why it's always so complicated. — meneldal, Dec 21 '15 at 05:03
@meneldal `Unicode could also be wide strings` - no, it's not `could also be` but rather `is always`. — dxiv, Dec 21 '15 at 05:15
@dxiv I meant that when you say Unicode (out of context) it doesn't not specify the encoding so it could be UTF-8 (normal 8-bit `char`) or UTF-16 (using `wchar_t`) or even UTF-32. What I forgot was that Windows uses `wchar_t` for unicode. — meneldal, Dec 21 '15 at 09:01

meneldal · Accepted Answer · 2015-12-21T09:07:58.017

2

The most likely issue you're facing is that you are not compiling correctly. According to the CString documentation:

A CStringW object contains thewchar_t type and supports Unicode strings. A CStringA object contains the char type, and supports single-byte and multi-byte (MBCS) strings. A CString object supports either the char type or the wchar_t type, depending on whether the MBCS symbol or the UNICODE symbol is defined at compile time.

The actual underlying type depends on your compilation parameters. What is most likely happening is that it's trying to compare a Unicode string with your MBCS string literal value and doesn't return anything.

If you want to fix this you should decide if you want to use Unicode or MBCS and update your compilation parameters accordingly, defining either MBCS or UNICODE.

If you use Unicode, you will have to change your string literal because it currently works for MBCS. You can either use the codepoint L'\u3010' which will return the good character or make sure your file is using a Unicode encoding and use u'【'.

edited Dec 21 '15 at 09:07

answered Dec 21 '15 at 02:34

meneldal

1,717
1
21
30

A quick way to check for this kind of thing would be to run in debug mode so you see how your string get converted (or not) and why you end up having results like this. – meneldal Dec 21 '15 at 02:38
1

The posted code must be compiled for Unicode already, since it calls `CString::Find(L'【')`. Character literals defined as `L'X'` are of type `wchar_t` and CStringA does not have a `Find` overload that takes a `wchar_t` argument. So in order for the code to compile at all, CString must be CStringW i.e. a Unicode compile with UNICODE defined. – dxiv Dec 21 '15 at 04:31
@dxiv Thanks for the comment I have updated my answer and it should be good now. – meneldal Dec 21 '15 at 09:10
Thanks for help. I wasn't aware of the difference in using mbcs and Unicode. – Brandon Nece Dec 23 '15 at 00:22

selbie · Answer 2 · 2015-12-21T03:51:45.143

2

Most likely your editor isn't properly encoding the hardcoded 【 and 】 as the unicode chars you seek. Visual Studio sometimes gets this right with auto-encoding the source file as UTF8, but that's not always reliable and may not survive a source control system that expects ascii.

Easiest thing to do is use the \uNNNN syntax to match the chars.

    if(PulledFile.Find(L'\u3010') != -1)
    {
        while (PulledFile.Find(L'\u3011') != -1)
        {
            PulledFile = PulledFile.Right(PulledFile.GetLength() - 1);
        }
    }

Where \u3010 and \u3011 are the hex escape sequences for the unicode values of【 and 】respectively.

edited Dec 21 '15 at 03:51

answered Dec 21 '15 at 02:38

selbie

100,020
15
103
173

1

An escape like `\x3010` with *4* hex digits is a MS extension, I believe. The more standard `'\u3010'`, `'\u3011'` should work as well. – dxiv Dec 21 '15 at 02:44
It might still fail because of a UTF-8/wide char comparison. VS will ask you to change your encoding if you use characters outside ASCII so I assume this wasn't the issue. – meneldal Dec 21 '15 at 03:29
1

@dxiv - Thanks. Answer fixed. – selbie Dec 21 '15 at 03:45
@meneldal - That's been my expectation to. But when I ran a local test in VS2015 with `const wchar_t* psz = L"【Title】";` as a test string, Visual Studio *did not* prompt or auto-encode the source. It left it as ascii and treated the bracket chars as a literal `'?'` (0x3f). I had to *explicitly* save the source as UTF8 to get it to work. Hence, my suggestion. I had assumed he was already building as Unicode since he was using the wide-char literal `L'【'` in code – selbie Dec 21 '15 at 03:51
@dxiv - Oops. Thanks. – selbie Dec 21 '15 at 03:51
Thank you for your help. – Brandon Nece Dec 23 '15 at 00:22

How to read a character not included in ascii in c++?

2 Answers2