Why does boost::filesystem::canonical() require the target path to exist?

Question

The documentation for boost::filesystem::canonical(const path& p) states:

Overview: Converts p, which must exist, to an absolute path that has no symbolic link, dot, or dot-dot elements.
...
Remarks: !exists(p) is an error.

The consequence of this is that if p identifies a symbolic link whose target does not exist, the function fails with file not found and does not return a path.

This seems overly restrictive to me: just because the target of the link doesn't exist, I see no reason why the function can't resolve the path of that non-existent target. (In comparison, absolute() imposes no such restriction.)

~~(Clearly, if a symbolic link within the path is broken, the target path can't be resolved.)~~ Even if a symbolic link within the path is broken, the hypothetical target path can be formulated from the resolvable part of the path plus the unresolvable remainder.

So, is there a legitimate justification for this restriction?

And even if there is, is there not also justification for the creation of a variant of the function that does not have this restriction? (Without such a variant, obtaining the path requires error-prone manual replication of 99% of what canonical() already does.)

I appreciate that the semantic subtleties that exist between stat() and lstat() apply equally to this case - which is precisely why I think a variant of the function is equally justified.

NB: This question is equally applicable to the std::experimental::filesystem library (n4100), which is based on boost::filesystem.

EDIT:

After @Jonathan Wakeley's very knowledgeable answer below, I'm still left with the essence of my original questions, which I'll reframe slightly:

Is there an underlying technical or logical reason why boost::filesystem::canonical() requires the target to exist? By that I mean, does the non-existence of the target somehow make it impossible to resolve the path to canonical form?
If not, is there any technical or logical reason not to propose a variation of the function that differs only from the existing form in that it does not require the target to exist?
In the transformation (as I understand to be the case) of boost::filesystem into the proposed N4100 std::experimental::filesystem, has this restriction on canonical() been adopted after due consideration, or is it just 'falling through' from the Boost definition?

EDIT 2:

I notice that Boost 1.60 now provides the function weakly_canonical(): "Returns p with symlinks resolved and the result normalized. Returns: A path composed of the result of calling the canonical() function on a path composed of the leading elements of p that exist, if any, followed by the elements of p that do not exist, if any."

EDIT 3:

More discussion of this in relation to std::filesystem.

Coming late, but... if `canonical()` gave the path to that non-existing symlink target as a result, and a second later that target *gets* created *as a symlink to a different location*, the path returned by `canonical()` earlier would break the contract of `canonical()`, no? (Just asking.) — DevSolar, Feb 18 '16 at 14:23
Yes, except that _any_ result returned by `canonical()` is potentially instantly out of date. For example the same can also be said if `canonical()` returns the path to an _existing_ file and any part of the path is then changed or deleted. In fact, since it's implemented as iterative resolution of each path component, `canonical()` is vulnerable to file system changes _while it is in progress_. — Jeremy, Feb 18 '16 at 14:37

score 10 · Answer 1 · answered Apr 07 '17 at 16:11

10

try weakly_canonical() it does not require path to exist on mac

answered Apr 07 '17 at 16:11

apiashko

121
1
3

Yes. See Edit 2 to the question. – Jeremy Apr 10 '17 at 07:32

Jonathan Wakely · Answer 2 · 2015-07-10T14:29:34.700

6

Basically because it's a wrapper for realpath which has the same requirement.

You could ask the same question of realpath, but I think the answer is that if you're trying to find out the real, physical file or directory that a pathname refers to, then if it is a broken symlink then there is no answer, it doesn't refer to a real file or directory, so you want an error.

The OP's comment below questions my claim that filesystem::canonical and realpath implement the same operation, but the definitions in N4100 and POSIX seem almost identical to me, compare:

The realpath() function shall derive, from the pathname pointed to by file_name, an absolute pathname that resolves to the same directory entry, whose resolution does not involve '.', '..', or symbolic links.

and:

Converts p, which must exist, to an absolute path that has no symbolic link, ".", or ".." elements.

In both cases the requirements are:

no symbolic links, if it returned a path where the last component is a symbolic link that requirement would not be met.
the canonical path refers to something that exists, this is explicit in N4100, and implicit in POSIX in that it points to some directory entry (i.e. something that exists) and the directory entry is not a symbolic link (because of the first requirement).

As to why those should be the requirements, the note in N4100 is helpful:

[Note: Canonical pathnames allow security checking of a path (e.g. does this path live in /home/goodguy or /home/badguy?) —end note]

As I already said above, if it returns successfully even when the path is a symlink that doesn't actually point to anything, then you need to do extra work to check if it resolves to a real file or not, making the intended use case less convenient.

And even if there is, is there not also justification for the creation of a variant of the function that does not have this restriction? (Without such a variant, obtaining the path requires error-prone manual replication of 99% of what canonical() already does.)

Arguably that variant would be less commonly useful, and so should not be the default, but if you need it then it's not difficult to do:

// like canonical() but allows the last component of p to be a broken symlink
filesystem::path
resolve_most_symlinks(filesystem::path const& p, filesystem::path const& base = filesystem::current_path())
{
  if (is_symlink(p) && !exists(p))
    return canonical(absolute(p, base).remove_filename()) / p.filename();
  return canonical(p);
}

edited Jul 10 '15 at 14:29

answered Jul 10 '15 at 10:20

Jonathan Wakely

166,810
27
341
521

1

Boost (version 1.55, at least) doesn't implement it in terms of realpath(), even under posix. – Jeremy Jul 10 '15 at 10:55
And the concept of a canonical path is not the same as the path of a real, physical file or directory. – Jeremy Jul 10 '15 at 10:58
@Jeremy, thanks for the info about Boost, I assumed it would use `realpath` (that's what I did for GCC's N4100 implementation, and the only other POSIX-based N4100 implementation I know of does). Does Boost use another OS function or do it by hand? – Jonathan Wakely Jul 10 '15 at 14:00
As for your second comment, N4100 defines `canonical` as "An absolute path that has no elements that are symbolic links, and no dot or dot-dot elements" and POSIX defines `realpath` as deriving "an absolute pathname that resolves to the same directory entry, whose resolution does not involve '.', '..', or symbolic links" ... so whatever semantics you want to give to "canonical path" or "real, physical file or directory" doesn't change the fact that `canonical` and `realpath` are defined to do the same thing (whether that's to give a canonical path, or path to a real file, or something else). – Jonathan Wakely Jul 10 '15 at 14:03
Also whatever your definition of "canonical path" is, the Filesystem TS is quite clear what its definition is, see 4.2 [fs.def.canonical-path] "An absolute path that has no elements that are symbolic links, and no dot or dot-dot elements." So clearly the operation you want cannot be called `canonical` because it could return something containing a symlink as the last element, which would not be a canonical path as defined by the TS. – Jonathan Wakely Jul 10 '15 at 16:35
No, it would return the path that the symlink in the last element resolves to - which, since it doesn't exist, can't possibly be a symlink itself, and so would perfectly satisfy the definition of a canonical path. – Jeremy Jul 10 '15 at 20:38
Your example doesn't provide the required information anyway. If `p` is `/foo/bar` and `bar` is a symlink to the nonexistent path `/baz/qux`, your example will return, erm.. `/foo/bar`. Surely we would need to do something like `return absolute(read_symlink(p), canonical(p.remove_filename());` - but even that's not going to give us what we need because the path returned by `read_symlink()` might itself contain valid symlinks that need to be canonicalised - i.e. it needs to replicate 99% of what `canonical()` already does. – Jeremy Jul 10 '15 at 21:06
1

Ignoring the questionable approach to security checking represented by the goodguy/badguy use case, determining whether `p` resolves to a real file or not is a separate concern already addressed by `exists(p)`. – Jeremy Jul 10 '15 at 22:10
Ah, I misunderstood what your desired behaviour is. I still maintain that the behaviour is designed to mirror that of `realpath` which is defined by POSIX to require the file to exist. – Jonathan Wakely Jul 11 '15 at 11:42

Why does boost::filesystem::canonical() require the target path to exist?

2 Answers2