3

I’m trying to write a console application that can accept filename arguments and want it to be able to handle Unicode filenames. The problem is that I cannot figure out how to test it.

How can you pass Unicode arguments to a console app?

I tried creating a Unicode batch file that calls the program, passing it some Unicode characters, but it doesn’t work; the command-prompt can’t launch the program at all because it gets tripped up on the null-characters in its filename. I tried changing the code page to 65001 and Alt-typing a Unicode character at the command-line, but that didn’t work either.

Below is a sample program. I’m trying to find a way to get the following output:

C:\> unicodeargtest Foobar
46, 0, 6f, 0


// UnicodeArgTest.cpp
#define UNICODE
#include <tchar.h>
#include <stdio.h>
int wmain (int argc, wchar_t**argv) {
    printf("%x, %x, %x, %x\n", argv[1][0], argv[1][1], argv[1][2], argv[1][3]);
}
Synetech
  • 9,643
  • 9
  • 64
  • 96
  • cmd.exe supports Unicode? i think, it's not (due to compability with old DOS programs). Maybe the problem source is here - so, try to use MS PowerShell. – Raxillan Mar 26 '12 at 06:19
  • @Raxillan: It definitely does. CDM.EXE has no compatibility with old DOS programs, it's fully 32 (or 64 bits). It does have some compatibility with batch files, but DOS batch files are ASCII text-based and ASCII is a subset of Unicode. – MSalters Mar 26 '12 at 08:39
  • @MSalters In Windows 7 only, or in previous versions too? so: do you think typing "chcp 65001" and inserting some Unicode symbol into command line should work? as i understand, author do this - with no result. – Raxillan Mar 26 '12 at 08:52
  • @Raxillan: No, `CHCP` is not the right approach. `65001` is the UTF-8 codepage, but Windows uses UTF-16 for Unicode. `CHCP` only works for double-byte character sets, and UTF-8 may take as many as 4 bytes. This is the same from NT 3.1 to 8. – MSalters Mar 26 '12 at 09:07
  • @MSalters So, what is the right way for using UTF-8? – Raxillan Mar 26 '12 at 09:25
  • @Raxillan: There isn't. CMD.EXE supports Unicode, but not UTF-8. Your alternative, MS PowerShell uses .Net Strings, which also are UTF-16. – MSalters Mar 26 '12 at 09:31
  • @MSalters Thanks, it's now clear to me. I have to stop this mini-chat. – Raxillan Mar 26 '12 at 09:42

2 Answers2

2

Oh blerg! It happened again. I come from an assembler background, so occasionally some C++ stuff trips me up. One thing that I keep forgetting is how in C++, the compiler takes the liberty of automatically compensating for type sizes when computing indexes, pointers, and such.

For example:

DWORD dwa[4] = {1,2,3,4};
//dwa[2] references the third DWORD in the array (i.e., the ninth BYTE),
//NOT the second BYTE in the array

or

struct EGS {
    char  str[5];
    int   num;
};
EGS   eg = {0};
EGS* peg = &eg;
peg++;
//peg is incremented by a whole EGS’ worth of bytes, NOT just 1
//for EGS, it is increased by 12 (5+4=9, rounded to the nearest 4, equals 12)

In this case, because the arguments are being interpreted as wide (2-byte) characters, argv[1][1] isn’t a null-character, it is the second Unicode character.

Using the program as is and passing a Unicode character, I get this:

C:\>unicodeargtest ‽‽‽‽
203d, 203d, 203d, 203d

I simply pasted the interrobangs into the command-prompt. In my normal command-prompt mode (using Raster Fonts and code-page 437), they display as ? instead of , but it still gives the same results.


By casting the arguments to char or BYTE as so:

printf("%x, %x, %x, %x\n",
    ((BYTE*)(argv[1]))[0], ((BYTE*)(argv[1]))[1],
    ((BYTE*)(argv[1]))[2], ((BYTE*)(argv[1]))[3]
);

I get the expected results:

C:\>unicodeargtest ‽‽‽‽
3d, 20, 3d, 20

C:\>unicodeargtest Foobar
46, 0, 6f, 0

Pasting Unicode characters works, but using a batch file still doesn’t. A Unicode one still has the problem with the program’s filename being interpreted incorrectly due to the null-characters, and saving it as UTF-8 causes it to not run at all.

Community
  • 1
  • 1
Synetech
  • 9,643
  • 9
  • 64
  • 96
  • For the record, that's not a C++ feature. C behaves exactly the same. – Harry Johnston Mar 26 '12 at 07:03
  • That's hardly a revelation! Every language I know does that. – David Heffernan Mar 26 '12 at 07:16
  • Batch files should work in Unicode. Try using Notepad "Save as Unicode", that should get you the right format. (with Byte Order Mark, to be precise) – MSalters Mar 26 '12 at 08:43
  • @MSalters, have you tried it? I did and like I said, it didn’t work; the program name has null-characters: `'■u' is not recognized as an internal or external command, operable program or batch file.` (The character before the `u` is a null. – Synetech Mar 26 '12 at 16:56
  • Eh, no, had that in some corner of my mind. You're right - doesn't work as I thought. CMD.EXE supports Unicode in its `TYPE` command and others, but not everywhere. So even though it can show the contents of a batch file correctly, it can't run it. :\ – MSalters Mar 26 '12 at 22:20
1

Drag-and-drop should do the trick. In Explorer, drag the file whose name you want to pass as an argument onto the test executable. (You might first want to change the executable so that it waits before exiting.)

Harry Johnston
  • 35,639
  • 6
  • 68
  • 158