2

I'm having trouble opening a file that has Unicode characters in its name. I created a file on my desktop with just a couple lines of text.

c:\users\james\desktop\你好世界.txt

EDIT: I'm using CLion. CLion is passing parameters in unicode.

When I put that string into the Windows run dialog, it finds the file and opens it.

Something interesting, though, is that I get double L'\\' L'\\' in the folder name from my call to CommandLineToArgvW:
L"c:\\\\users\\\\james\\\\desktop\\\\你好世界.txt"

So I wrote a small routine to copy the filename to another wchar_t * and strip the slashes. Still doesn't work.

errno == 2 and f == NULL.

size_t filename_max_len = wcslen(filename);

//strip double slashes
wchar_t proper_filename[MAX_PATH + 1];

wchar_t previous = L'\0';
size_t proper_filename_location = 0;

for(int x = 0; x < filename_max_len; ++x)
{
    if(previous == L'\\' && filename[x] == L'\\')
        continue;

    previous = filename[x];
    proper_filename[proper_filename_location++] = filename[x];
}

proper_filename[proper_filename_location] = L'\0';

//Read in binary mode to prevent the C system from screwing with line endings
FILE *f = _wfopen(proper_filename, L"rb");

int le = errno;

if (f == NULL)
{
    perror(strerror(le));

    if(le == ERROR_FILE_NOT_FOUND)
    {
        return DUST_ERR_FILE_NOT_FOUND;
    }
    else {
        return DUST_ERR_COULD_NOT_OPEN_FILE;
    }
}
Bluebaron
  • 2,289
  • 2
  • 27
  • 37
  • Actually errno == 2. I think that means file not found. – Bluebaron Sep 04 '15 at 19:06
  • 1
    if `errno` isn't `0` then check out what `strerror()` says – Jack Sep 04 '15 at 19:19
  • Nothing makes sense here. What do you think 4 backslashes mean? Why 4? – David Heffernan Sep 04 '15 at 19:21
  • 1
    Debuggers often double up backslashes to make the strings look like C literal strings. Maybe that's what's happening? But I see nothing too bad with your code here, so show the code that creates 'filename'. – Roddy Sep 04 '15 at 19:22
  • 1
    On Windows they're ignoed. `c:\\\\\\\\\\\\\\\\\\\\\\\\\\bin` is same as `c:\bin` – Jack Sep 04 '15 at 19:22
  • also, you need `+1` in your `malloc()` to store `L'\0'` – Jack Sep 04 '15 at 19:24
  • 1
    @Jack Better yet, no malloc or free. `wchar_t proper_filename[filename_max_len+1];` – Roddy Sep 04 '15 at 19:25
  • assuming he's using C99, that's better. – Jack Sep 04 '15 at 19:26
  • Four backslashes is two backslashes for escape code. But there's two sets of them. So I stripped the second set of back slashes giving just \\. – Bluebaron Sep 04 '15 at 20:13
  • K. I implmented all your suggestions. The output of strerror(le) is "No such file or directory." I'm wondering if this has something to do with CLion debug parameters. I think it could be passing the input as unicode or something. – Bluebaron Sep 04 '15 at 20:20
  • Here's the bytes that I get for the non-ascii portion of the string. 228, 189, 160, 229, 165, 189, 228, 184, 8211, 231, 8226, 338. I was under the impression that a wchar_t is simply 16-bits long. So each entry should represent a character and thus there should only be 4 wide characters and not 12. Please advise if I'm wrong. – Bluebaron Sep 04 '15 at 20:23
  • Please double check my edit. Posts on SO use '\' as an escape character too and may render the edit incorrect. – chux - Reinstate Monica Sep 04 '15 at 20:31
  • My apologies if I messed the post up. – chux - Reinstate Monica Sep 04 '15 at 20:32
  • How are you verifying, how many backslash characters there are? As @Roddy pointed out already, some debuggers (including Visual Studio) display backslash characters as 2 consecutive backslash characters, to mimic C's character string literal syntax. Open a memory view and inspect the characters there. – IInspectable Sep 04 '15 at 20:41
  • @chux I was aware of that and now there's twice as many slashes as there should be. – Bluebaron Sep 04 '15 at 21:04
  • @IInspectable If you look at the top of the post and some of the comments: There's two L'\\' L'\\'. The first slash in each is the escape character. – Bluebaron Sep 04 '15 at 21:06
  • Again, **how** are you determining, that there are extra backslash characters? If you are inspecting the character strings in a debugger like Visual Studio, backslash characters are **displayed** as double backslash characters. That's not the true string content. Open up a memory view and look at the contents there. This is the Real Thing™. – IInspectable Sep 05 '15 at 14:28
  • @IInspectable Again, YES they are escape slashes. Are you reading any of my comments before commenting? – Bluebaron Sep 06 '15 at 05:48
  • Using ' indicates a character and not a string. Thus, L'\\' can only be an escaped \. I also said they were escaped slashes twice above and in the original question I also put L'\\'. – Bluebaron Sep 06 '15 at 05:52
  • *"I get double `L'\\'` `L'\\'` in the folder name from my call to CommandLineToArgvW"* How are you determining, that those really are double backslash characters? When you're looking at string contents in the *Autos* or *Locals* window in Visual Studio, string literals are displayed with **extra** backslash characters. Those characters aren't there. Look at the contents in a *Memory* window instead. – IInspectable Sep 06 '15 at 11:44
  • How could they not be escaped characters? L'ab' is invalid because ' means character so L'\\' must mean escaped character. – Bluebaron Sep 06 '15 at 15:11
  • Just like L'\0' or L'\n' ... now if I said L"\\" I could see you having a problem understanding this. – Bluebaron Sep 06 '15 at 15:12
  • You aren't **constructing** the string from character literals. It is the result of calling `CommandLineToArgvW`. There are no string or character literals involved, hence the question, how you determine, that there are extra backslash characters. Why are you consistently avoiding answering this seemingly simple question? The reason behind this question has been stated multiple times already. – IInspectable Sep 07 '15 at 08:51
  • Two ways, one of which I already mentioned above `if(previous == L'\\' && filename[x] == L'\\')` and by manual inspection through the CLion debugger where I see L'\\' L'\\'. I don't know why you think I'm not smart enough to know the difference between L'\\' and L"\\\\" or why you can't understand when I say L'\\' I mean an escaped \. No one else seems to think I'm stupid. I mean ... I know few people who would be confused, after programming in almost any language for more than two days, as to what an escaped character is but apparently I found one. – Bluebaron Sep 08 '15 at 14:29
  • Not to mention we're so far off topic here. Someone else mentioned above that it couldn't possibly have anything to do with more than one slash because Windows file path slashes are idempotent and the answer has already been posted. – Bluebaron Sep 08 '15 at 14:39

2 Answers2

0

I have figured out the issue. My hunch was correct. CLion appears to be providing unicode as input to the program. Using the Windows run dialog and passing it as a parameter to my program, I was able to open and process the file without an issue.

Bluebaron
  • 2,289
  • 2
  • 27
  • 37
  • I filed a bug with CLion and they've accepted and assigned the issue. https://youtrack.jetbrains.com/issue/IDEA-144864 – Bluebaron Sep 07 '15 at 17:08
0

My first guess is that 228, 189, 160 represents the first character of your filename encoded as a UTF-8 byte sequence since it looks like such a sequence to me. E4 BD A0 (228, 189, 160) decodes as U+4F60, which is indeed the Unicode code point corresponding to the first character.

I modified the output section of main in my sample program here to print each argument as a hex-encoded byte sequence. I copied and pasted your path as an argument to the program, and the Han characters are encoded in UTF-8 as:

E4 BD A0
E5 A5 BD
E4 B8 96
E7 95 8C

Your comment mentions slightly different numbers (specifically 8211/U+2013, 8226/U+2022, and 338/U+0152). Looking at code pages Windows 1250 and Windows 1252, bytes 0x96, 0x95, and 0x8C in both code pages correspond exactly to U+2013, U+2022, and U+0152 respectively. I'm guessing your original program goes wrong somewhere when it encounters Unicode input (you are using GetCommandLineW and passing that to CommandLineToArgvW, right?)

Here's a screenshot of my output that I've edited to highlight the relevant character sequences (the ¥ glyphs are meant to be \ glyphs, but I use code page 932 for cmd.exe):

program output with highlighted UTF-8 bytes

Community
  • 1
  • 1