4

I'm currently learning the art of Unicode Programming, and applying it to a personal project. Soon I realized how it is really difficult to get it right, and even to understand if you did it correctly: if the tool is wrong, you can be mistaken in evaluating the results of your work.

My small goal in this exercise is to understand what I should pass to mkdir versus what is good for File::Path::make_path. In other words: what do they expect? Will they handle the encoding depending on the locale, or should I do it for them?

I wrote the following scripts, which takes arguments from @ARGV and for each of them creates the directory $_, using both functions and both encoded and decoded froms.

#!/usr/bin/perl

use warnings;
use strict;
use utf8;
use v5.16;

use Encode;
use Encode::Locale;

use File::Path qw/make_path/;
use File::Spec;

# Everything under the './tree' directory
mkdir 'tree';
mkdir File::Spec->catdir('tree', $_)
    for ('mkdir', 'mkdir_enc', 'make_path', 'make_path_enc');

foreach (map decode(locale => $_) => @ARGV) {
    mkdir File::Spec->catdir('tree', 'mkdir', $_);
    mkdir encode(locale_fs => File::Spec->catdir('tree', 'mkdir_enc', $_));

    make_path(File::Spec->catdir('tree', 'make_path', $_));
    make_path(encode(locale_fs => File::Spec->catdir('tree', 'make_path_enc', $_)));
}

I executed the script as follows:

./unicode_mkdir.pl a→b←c

What I would expect is:

  • Either tree/mkdir [x]or tree/mkdir_enc contain directories named gibberish;
  • Either tree/make_path [x]or tree/make_path_enc contain directories named gibberish;

With great surprise I found out that all version work properly. I verified it with find:

$ find tree
tree
tree/mkdir_enc
tree/mkdir_enc/a→b←c
tree/mkdir
tree/mkdir/a→b←c
tree/make_path_enc
tree/make_path_enc/a→b←c
tree/make_path
tree/make_path/a→b←c

I realized that the tree command makes it so wrong… (a quite common disease) but at least I could see that the results are all the same:

$ tree tree
tree
├── make_path
│   └── a\342\206\222b\342\206\220c
├── make_path_enc
│   └── a\342\206\222b\342\206\220c
├── mkdir
│   └── a\342\206\222b\342\206\220c
└── mkdir_enc
    └── a\342\206\222b\342\206\220c

8 directories, 0 files

A ls -R command seems to confirm it.

$ ls -R tree
tree:
make_path  make_path_enc  mkdir  mkdir_enc

tree/make_path:
a→b←c

tree/make_path/a→b←c:

tree/make_path_enc:
a→b←c

tree/make_path_enc/a→b←c:

tree/mkdir:
a→b←c

tree/mkdir/a→b←c:

tree/mkdir_enc:
a→b←c

tree/mkdir_enc/a→b←c:

So my questions are:

  1. Am I doing it right code-wise ('course not)?

  2. Am I doing it right filesystem-wise?

  3. How can mkdir and make_path figure out and fix the wrong one?

  4. Or maybe I was just "reverse-lucky" (the kind of lucky which doesn't allow you to realize your error, since in your case it? In that case, how I can test it out effectively?

Any hint?

Dacav
  • 13,590
  • 11
  • 60
  • 87
  • 1
    Common file systems in UNIX don't have a file name encoding, i.e. they treat the file name as octets and not characters. Interpretation as characters is in your case done by the terminal and depends on the locale. – Steffen Ullrich Mar 26 '16 at 09:31
  • See also [Creating filenames with unicode characters](http://stackoverflow.com/questions/31371257/creating-filenames-with-unicode-characters) and [Could File::Find::Rule be patched to automatically handle filename character encoding/decoding?](http://stackoverflow.com/questions/31383690/could-filefindrule-be-patched-to-automatically-handle-filename-character-enc) – Håkon Hægland Mar 26 '16 at 09:33
  • @SteffenUllrich, I guess they work because my locale specifies UTF8 both for perl (`encode('locale')`) and in find/ls. Still I expected some ugly thing, as the result of `encode('utf-8', encode('utf-8', ...))` – Dacav Mar 26 '16 at 09:39
  • 1
    @HåkonHægland, so you would say that `encode` is the good thing to do, and I do agree on this. But it is not clear from the documentation (of both `mkdir` and `File::Path`) if they would internally do an encoding for me or not. The `File::Find::Rule` question is indeed very related! – Dacav Mar 26 '16 at 09:51
  • On a UNIX system you might as well omit all encode and decode, and still get the same result. The unicode characters you type are encoded accpording to the locale, and then it's just octets. (`echo a→b←c | wc -c`). - But *doing it right* depends on your requirements: Is the code just for your system? Do you need portability across windows/mac/unix and perhaps VMS? - Seeing the result of `echo a→b←c` on my terminal and due to other, weightier reasons, I would avoid filenames that aren't made up from US-ASCII. :-) – laune Mar 26 '16 at 10:04
  • @Dacav Yes, I think `encode()` should be used before passing argument to `mkdir` or `File::Path::make_path()`. The reason why it works without encoding first is described in `@ikegami`'s answer [here](http://stackoverflow.com/a/31371612/2173773) – Håkon Hægland Mar 26 '16 at 14:16

1 Answers1

2
  1. How can mkdir and make_path figure out and fix the wrong one?

Perl strings have a "UTF-8 flag" that indicates whether the "characters" they contain are Unicode characters, vs. octets (eight-bit bytes). You can use the utf8::is_utf8 function (see http://perldoc.perl.org/utf8.html) to check if the UTF-8 flag is set for a given string; or you can use Dump from the Devel::Peek module, which prints out all the guts of a scalar, including the list of flags that are set.

So mkdir and make_path don't need to do anything too crazy; they can handle Unicode strings by encoding them as octet strings, just as you're doing when you call encode.

(Unfortunately, the UTF-8 flag thing has a lot of quirks, and not all functions honor it; for example, encode doesn't care whether its argument has that flag set, it just trusts that you wouldn't be calling it on a string unless the string was supposed to be interpreted as a sequence of Unicode characters. But if you use modern, Unicode-aware libraries, and use utf8, and just do everything Unicode-ishly except when specifically interacting with byte-oriented external systems (which you use Encode::encode and Encode::decode for), you should be fine.)

  1. Am I doing it right code-wise ('course not)?
  2. Am I doing it right filesystem-wise?

Yes, except I think you should pay more attention to error cases. What if your input can't be represented in the locale character set? What if it can, but the result isn't a valid filename in your operating system or filesystem?

To address this, you should make two or three changes:

  • You should provide an explicit third argument to Encode::encode to specify how it should handle non-encodable characters. (The default behavior is to replace them with a replacement character, such as ? for US-ASCII; that's probably not what you want.)
  • You should examine the return-value of mkdir.
  • You may want to use the error option to make_path, and examine the resulting arrayref; or, alternatively, you may want to wrap make_path in an eval block.
ruakh
  • 175,680
  • 26
  • 273
  • 307