0

[Edit/Disclaimer]: Comments pointed out that I have to clarify the encoding the user uses. Will update accordingly

I have a customer from China who recently reported an issue with their filenames on Windows. The software works with most Chinese characters, but it seems he has found one file that fails.

Unfortunately, they are not able to send me over the filename as neither zipping nor transmitting the file through other mediums seem to preserve the filename.

What is the easiest way (e.g. through Python) to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?

HelloWorld
  • 2,392
  • 3
  • 31
  • 68
  • Please [edit] your question to provide a [mcve]. – JosefZ Oct 30 '21 at 19:06
  • Unfortunately, there is no minimal reproducible example. The question is quite clear imo – HelloWorld Oct 30 '21 at 19:07
  • 1
    The question is utterly unclear. Please give an example of at least one failing character… – JosefZ Oct 30 '21 at 19:09
  • 1
    UTF-8 supports all Unicode code points. Windows uses UTF-16 internally, which also supports all Unicode code points. Perhaps the customer has a font the doesn't support the character, or the character has not yet been assigned a code point in Unicode? – Mark Tolonen Oct 30 '21 at 19:10
  • @JosefZ I acknowledge your comment, but disagree, sorry! :-( I would give an example for a character if I would know which one. If I would know which one I would be able to create the file myself to figure out whats going on – HelloWorld Oct 30 '21 at 19:13
  • @Mark: Thank's a lot for your comment and the clarification! I have to think about that, that spins me into the right direction – HelloWorld Oct 30 '21 at 19:14
  • Perhaps the customer is using a Chinese encoding such as `gbk`, `gb2312` or `gb18030` and not UTF-8. In this case, even a *picture* of the character would help. I agree with @JosefZ the question is not answerable as is. – Mark Tolonen Oct 30 '21 at 19:14
  • Interesting, I wasn't aware of these encodings. I will update my question according to the answer I get from this. Thanks! – HelloWorld Oct 30 '21 at 19:15
  • An example of the X-Y problem: [1] The core issue is that _"software works with most Chinese characters, but it seems he has found one file that fails"_. No other information is provided. [2] Then you to ask _"What is the easiest way ... to generate a filename on Windows that is covered by the NTFS file system encoding but not UTF8?"_. Either you have not given us the full story, or you are making an unwarranted assumption about the failure, because there could be many causes. You need to provide more information about your customer's actual problem rather than focusing on your "solution". – skomisa Oct 31 '21 at 01:22
  • @skomisa I understand your objection and I am aware of the X-Y problem but an X-Y question is not inherently wrong. In some situations giving the full context is simply not possible. Sometimes I go with an X problem, to reach Y. And I believe that a software engineer should be able to ask a question: "which unicode codepoint isn't covered by NTFS?"as it is a very pricese question. If the answer is "that doesn't exist" that's a perfect answer and I know how to go from here. Regarding clarification, I agree, I could have phrases the question better – HelloWorld Oct 31 '21 at 15:53
  • 1
    @HelloWorld [1] OK, understood, but to avoid an X-Y problem I think it would have been better to post only your final paragraph as your entire question. [2] That said, I suspect you are going down the wrong path by focusing on NTFS rather than your source and target environments. So instead I'd suggest focusing on the code page(s), locale(s) and Windows version(s) being used in your environment, and that of the user creating the problematic file. Then update your question accordingly. [3] Can you get an image of the file name? Not ideal obviously, but better than nothing. – skomisa Nov 01 '21 at 20:11
  • @skomisa Well put, I need to reflect on that. thanks for sharing! – HelloWorld Nov 02 '21 at 00:58

1 Answers1

1

Unicode strings are encoded as a series of bytes. The rules of what a series of bytes visually looks like to you in an operating system, is what operating systems use to turn bytes into characters.

Given that Windows uses a (variation of-) Unicode, and you say you have a character that's not in unicode, it also means that there is simply no way to represent that character.

Imagine if unicode only contained the numbers 0-9, and you ask someone how to encode the letter A. There's no answer to this, because only 0-9 are defined.

You could make up a new unicode codepoint for your character, but then operating systems won't know what to do with that unless you also make your own font files.

I somehow doubt that that's what you want to do though, but it's an option. Could your customer rename the file before sending it to you?

Evert
  • 93,428
  • 18
  • 118
  • 189
  • Thanks so much for the insights! But if I'm not mistaken that is not correct. In your example you define 0 to 9, but Unicode has certain unassigned codepoints. So the example would be `0-4` and `7-9`. – HelloWorld Oct 30 '21 at 19:11
  • So in my case I am looking for a codepoint that seem to be valid for NTFS but is not assigned as a valid unicode codepoint (respectively covered by the utf-8 encoding) – HelloWorld Oct 30 '21 at 19:12
  • 1
    @HelloWorld No, NTFS uses UTF-16, which supports all Unicode code points (even unassigned). – Mark Tolonen Oct 30 '21 at 19:16
  • Makes sense. Understood, thanks a lot! – HelloWorld Oct 30 '21 at 19:17
  • 1
    @HelloWorld Of course, there are some characters that cannot be used in a filename, such as `\<>:` and could be limited in the current code page if the *application* is not Unicode-aware. See [Naming Files...](https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file). – Mark Tolonen Oct 30 '21 at 19:22
  • Ha, just found that link too. Thanks for the clarification! – HelloWorld Oct 30 '21 at 19:31