17

ISO C requires that hosted implementations call a function named main. If the program receives arguments, they are received as an array of char* pointers, the second argument in main's definition int main(int argc, char* argv[]).

ISO C also requires that the strings pointed to by the argv array be modifiable.

But can the elements of argv alias one another? In other words, can there exist i, j such that

  • 0 >= i && i < argc
  • 0 >= j && j < argc
  • i != j
  • 0 < strlen(argv[i])
  • strlen(argv[i]) <= strlen(argv[j])
  • argv[i] aliases argv[j]

at program start-up? If so, a write through argv[i][0] would also be seen through the aliasing string argv[j].

The relevant clauses of the ISO C Standard are below, but do not allow me to conclusively answer the titular question.

§ 5.1.2.2.1 Program startup

The function called at program startup is named main. The implementation declares no prototype for this function. It shall be defined with a return type of int and with no parameters:

int main(void) { /* ... */ }

or with two parameters (referred to here as argc and argv, though any names may be used, as they are local to the function in which they are declared):

int main(int argc, char *argv[]) { /* ... */ }

or equivalent; 10) or in some other implementation-defined manner.

If they are declared, the parameters to the main function shall obey the following constraints:

  • The value of argc shall be nonnegative.
  • argv[argc] shall be a null pointer.
  • If the value of argc is greater than zero, the array members argv[0] through argv[argc-1] inclusive shall contain pointers to strings, which are given implementation-defined values by the host environment prior to program startup. The intent is to supply to the program information determined prior to program startup from elsewhere in the hosted environment. If the host environment is not capable of supplying strings with letters in both uppercase and lowercase, the implementation shall ensure that the strings are received in lowercase.
  • If the value of argc is greater than zero, the string pointed to by argv[0] represents the program name; argv[0][0] shall be the null character if the program name is not available from the host environment. If the value of argc is greater than one, the strings pointed to by argv[1] through argv[argc-1] represent the program parameters.
  • The parameters argc and argv and the strings pointed to by the argv array shall be modifiable by the program, and retain their last-stored values between program startup and program termination.

By my reading, the answer to the titular question is "yes", since nowhere is it explicitly forbidden and nowhere does the standard urge or require the use of char* restrict*-qualified argv, but the answer might turn on the interpretation of "and retain their last-stored values between program startup and program termination.".

The practical import of this question is that if the answer to it is indeed "yes", a portable program that wishes to modify the strings in argv must first perform (the equivalent of) POSIX strdup() on them for safety.

Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
  • I disagree. The string pointed to by `argv[i]` are modifyable, thus modifying one would implicitly modify the other argument. Th at would defy the whole idea to allow modifying the parameters (e.g. with `strtok`). Also, `argv` and `arc` are modifyable does not say much, they ae normal local variables. The reason to not change the signature to `char * restrict *argv` is most likely for legacy reasons. Otoh the last part of "… or equivalent;10) **or in some other implementation-defined manner**." in 5.1.2.2.1p1 could read as allowing an implementation to enforce a more strict declarator. – too honest for this site Jun 10 '18 at 01:09
  • 1
    As you mention `strdup`, I'd recommend to check POSIX, too as that seems to be your environment. `strdup` is not part of the C standard. – too honest for this site Jun 10 '18 at 01:10
  • 1
    If this kind of thing was super-important to my program I would write a test in order to check. Use the POSIX `execl` to pass an array of argument pointers that alias, then see what happens in the program. – Zan Lynx Jun 10 '18 at 01:17
  • @Olaf Your point on `strdup()` is well-taken; I’ve edited to restrict ourselves to ISO, although if POSIX has explicit guarantees I am interested as well. But it would have been very easy to require pairwise non-aliasing to ensure that `strtok()` would work on arguments, never mind requiring `char* restrict *`. The WG did not. Is that intentional (to allow for space-saving), a lack of clarity, or a defect? – Iwillnotexist Idonotexist Jun 10 '18 at 01:36
  • 2
    @ZanLynx: Let's assume the test shows seperate strings are used, i.e. no aliasing. What does that proof? – too honest for this site Jun 10 '18 at 01:37
  • `argv[i]` being modifiable means that they should not aliasing each other, irrespective of the usage of `restrict`. One consequence of the word "modifiable" is that changing `argv[i]` in a well-formed program (and in the absence of undefined behaviour) should not have an effect of changing `argv[j]` if `i != j`. – Peter Jun 10 '18 at 01:39
  • @IwillnotexistIdonotexist: I agree there should be a clear statement. Would not be the first defect report for the standard not being fixed in multiple versions (see `volatile`). It would definitively be the next resource to look for. Anyway, all I can offer is Occam's Razor here, which would support my position. Not only about how arguments are passed to a program, but also I'd assume aliasing would break many programs. That's as much as I will contribute. – too honest for this site Jun 10 '18 at 01:43
  • 1
    @Peter: While I have the same position, it is indeed just an indicator, not a guarantee. Modifyable does not imply non-aliasing. I'd be quite surprised, if there was aliasing, though. – too honest for this site Jun 10 '18 at 01:45
  • I'm fairly sure there is no `i` such that `0 <= i && i < argc` and `argv[i] == NULL`; I think your fourth and fifth restrictions are redundant. I know `argv[argc]` is `NULL` but I don't think anything before that can be. – Daniel H Jun 10 '18 at 02:25
  • @Olaf - For an implementer of a compiler/library, there is plenty of down side and little upside in implementing such aliasing, unless they prioritise point scoring in debates with language lawyers above all. (1) It takes more work during startup to set things up (e.g find repeated command line arguments, set elements of the `argv` array equal to each other) (2) users of such implementations who modify their command line arguments would be displeased if changing `argv[2][3]` changed `argv[4][3]`, and even more displeased with a language-lawyer response to a bug report saying that is permitted – Peter Jun 10 '18 at 02:26
  • @Peter No doubt that exec() is harder to implement *with* than *without* aliasing. But you could imagine an embedded system and RTOS where it would be easier to directly pass an execve()-like function’s arguments to main(). The Standard also allows all kinds of esoteric choices no modern implementation would make as a sop to obscure machines of the past. Is this one of them? Occam’s razor suggests it’s an oversight, but if so it’s a defect that should be corrected. – Iwillnotexist Idonotexist Jun 10 '18 at 02:35
  • @Olaf: It proves that on each system you run that test on, that the arguments don't alias. – Zan Lynx Jun 10 '18 at 03:29
  • @ZanLynx: Not even that! It just shows for the current run (i.e. for the given arguments, maybe that specific binary, etc.) it doesn't alias. You cannot prove this by induction, only by deduction from the specs. Which is exactly the question here. – too honest for this site Jun 10 '18 at 11:53
  • @DanielH: Your assumption is correct. `argv[argc] == NULL` actually terminates the list. There is actually no need to use `argc` at all when processing the arguments, just checking for the null-pointer works as well. – too honest for this site Jun 10 '18 at 11:56
  • @IwillnotexistIdonotexist: Bare-metal embedded systems are typically freestanding environments, hence the whole paragraph does not apply by the standard - unless the specification of the environment states it does. In this case, it should specify this surprising behaviour. Nevertheless, on such systems there are easier ways to pass arguments to `main` if needed at all (typically not or just a single integer from startup). – too honest for this site Jun 10 '18 at 12:00
  • @Olaf One more usecase I can think of is that at least on Linux, the args and environment are passed in (up to) a quarter of the main thread’s stack. If aliasing is permitted, a benevolent C runtime might choose to run an argument/environment “string compactor” before handing control to `main()` in order to maximize available stack space. – Iwillnotexist Idonotexist Jun 10 '18 at 19:41
  • @IwillnotexistIdonotexist: When discussion the standard, i.e. language-lawyer as we do here, assuming a specific implementation or details like using a stack is not a good idea. Plus I don't see how this would be only rrelevant to Linux. It could be true for anything. Nevertheless, as I wrote above, I'd expect this to break a lot of existing programs. Plus there is not much benefit to expect from this, as there are rarely that many **identical** arguments. Plus either the startup time would become much longer for comparing or additional RAM was required for a map/etc. – too honest for this site Jun 10 '18 at 19:46
  • 1
    @Peter As an embedded dev, I'm very well aware of all this. Including the not programming-related points. Nevertheless it's a refreshing interesting question compared to what's normally new in the C tag the last years. I leave it to you to draw your conclusions what this tells us about the quality of the average C question :-\ – too honest for this site Jun 10 '18 at 23:28
  • **strict-aliasing** "Strict aliasing is an assumption, made by the C or C++ compiler, that de-referencing pointers to objects of different types will never refer to the same memory location (i.e. they will not alias each other)." The question is not about "strict-aliasing" – curiousguy Jun 15 '18 at 05:41
  • @curiousguy Well, it definitely wasn't [aliasing] or [antialiasing] by their tag-wiki definition; [strict-aliasing] seemed to be the closest approximation to what I was getting at. What would you propose? You have the rep to edit the tags if you want. – Iwillnotexist Idonotexist Jun 15 '18 at 05:48
  • I don't feel like changing a well defined tag. (Other tags need some cleanup or merging though.) – curiousguy Jun 15 '18 at 11:54
  • 1
    @curiousguy I changed to [alias], since that was a tiny bit better. – Iwillnotexist Idonotexist Jun 15 '18 at 11:55

3 Answers3

10

By my reading, the answer to the titular is "yes", since nowhere is it explicitly forbidden and nowhere does the standard urge or require the use of restrict-qualified argv, but the answer might turn on the interpretation of "and retain their last-stored values between program startup and program termination.".

I concur that the standard does not explicitly forbid elements of the argument vector from being aliases of each other. I don't think the modifiability and value-retention provisions contradict that position, but they do suggest to me that the committee did not consider the possibility of aliasing.

The practical import of this question is that if the answer to it is indeed "yes", a portable program that wishes to modify the strings in argv must first perform (the equivalent of) POSIX strdup() on them for safety.

Indeed, that's exactly why I think the committee didn't even consider the possibility. If they had done then surely they would have at least included a footnote to that same effect, or else explicitly specified that the argument strings are all distinct.

I'm inclined to think that this detail escaped the committee's attention because in practice, implementations indeed do provide distinct strings, and because it is rare, moreover, for programs to modify their argument strings (though modifying argv itself is somewhat more common). If the committee agreed to issue an official interpretation in this area, then I would not be surprised for them to come down against the possibility of aliasing.

Until and unless such an interpretation is issued, however, you are right that strict conformance does not permit you to rely a priori on argv elements not being aliased.

too honest for this site
  • 12,050
  • 4
  • 30
  • 52
John Bollinger
  • 160,171
  • 8
  • 81
  • 157
  • 1
    I think the statement that strings which are modified will retain their modified value is sufficient to guarantee that the strings are not aliased, neither to other arguments nor to environment variables nor to hidden mutable state. – rici Jun 10 '18 at 03:26
  • 1
    @rici, whereas I agree, as presented in this answer, that the committee did not intend to permit the argument strings to be aliased to each other (or to environment variables), the requirement that they "retain their last-stored value" does not speak to that. Unless you start by assuming the conclusion, of course. Aliasing would just provide additional ways to store a value. – John Bollinger Jun 10 '18 at 11:55
  • @JohnBollinger: Saying that an object "retains its last stored value" implies that its value won't change in response to actions that don't involve that object either directly or via pointer derived from it. If an object is aliased to anything whose value changes by other means, that object would cease to retain its last-stored value. – supercat Jun 12 '18 at 18:36
  • But if two `argv` elements are equal, @supercat, then using either of them to access the pointed-to argument string is indeed an action involving a pointer derived from that object (by the environment). For the value-retention provision to prohibit aliasing on those grounds requires a threshold assumption of non-aliasing, so nothing is proven that way. – John Bollinger Jun 12 '18 at 20:11
  • @JohnBollinger: The abstract machine doesn't care about how the environment comes up with addresses. What matters is whether any pointer that is derived *with the abstract machine* is used to modify argv[x][y], and in your scenario the aliased pointer would not be derived within the abstract machine. – supercat Jun 12 '18 at 20:22
  • @supercat, the assertion that the abstract machine only needs to care about aliasing through pointers that are derived within depends on the assumption that none of the pointers provided to it by the environment are aliased to each other, which, again, is the conclusion you are trying to reach. – John Bollinger Jun 12 '18 at 21:03
  • @JohnBollinger: A statement that an object will hold its last stored value can only be meaningful if one knows of all other objects to which it might be aliased, and that might be written in its lifetime. – supercat Jun 12 '18 at 22:24
  • 1
    I don't see that at all, @supercat. I can and do interpret the statement exactly at face value, recognizing the distinction between pointer values and the objects to which they point. That tells me the argument strings will change only as a result of program actions and according to C semantics, not arbitrarily, and it therefore gives me a whole universe of program actions that I can be confident will not change them. That it leaves unspecified whether certain other actions will change them does not make the provision meaningless. – John Bollinger Jun 13 '18 at 00:36
6

The way it works on common *nix platforms (including Linux and Mac OS, presumably FreeBSD too) is that argv is an array of pointers into a single memory area containing the argument strings one after another (separated only by the null terminator). Using execl() does not change this--even if the caller passes the same pointer multiple times, the source string is copied multiple times, with no special behavior for identical (i.e. aliased) pointers (an uncommon case with no great benefit to optimize).

However, C does not require this implementation. The truly paranoid may want to copy every string before modifying it, perhaps skipping the copies if memory is limited and a loop over argv shows that none of the pointers actually alias (at least among those the program intends to modify). This seems overly paranoid unless you are developing flight software or the like.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 4
    "flight software and the like" would not be modifying command line arguments to begin with - the effort/cost to provide assurance evidence would exceed, by orders of magnitude, the benefits of allowing developers to take such shortcuts. Even if (hypothetically) such things were allowed, assurance of the toolchain (and code it emits or uses) is equally as important as assurance of the actual software being built - there would be cross-checking of assumptions made in system design against behaviour of the toolchain. The compiler and its startup code is not treated as a black box. – Peter Jun 10 '18 at 02:41
  • 1
    @Peter: I'm not sure which part of the scenario you consider "taking a shortcut." – John Zwinck Jun 10 '18 at 02:47
  • I suspect that the 'short cut' is the part where the environment which runs a program is allowed to 'reduce memory usage' by have common strings in the argument list share the same storage, so that the argument strings are not independent of each other. – Jonathan Leffler Jun 10 '18 at 03:52
  • Mission-critical software will never use command line junk or stdio.h. – Lundin Jun 11 '18 at 09:46
  • 1
    @Lundin: I am not convinced that mission-critical programs would not use `argv`. And I never mentioned anything related to `stdio.h`. – John Zwinck Jun 11 '18 at 11:50
  • @JohnZwinck Mission-critical software doesn't allow any non-deterministic aspects. That rules out the whole of Linux and similar OS. You won't have any situations where a user launches a program through command line, nor will you have any situations where a program launches another program. That whole idea revolves around having a hosted RAM-based PC system which is just a big no-go. You'd rather have dedicated hardware and de-centralize by having several microcontroller boards communicating over some bus, not through command lines nor IPC. – Lundin Jun 21 '18 at 09:55
  • @Lundin: SpaceX use Linux on their rockets. – John Zwinck Jun 23 '18 at 09:28
2

As a data point, I have compiled and run the following programs on several systems. (Disclaimer: these programs are intended to provide a data point, but as we'll see, they do not end up answering the question as stated.)

p1.c:

#include <stdio.h>
#include <unistd.h>

int main()
{
    char test[] = "test";
    execl("./p2", "p2", test, test, NULL);
}

p2.c:

#include <stdio.h>

int main(int argc, char **argv)
{
    int i;
    for(i = 1; i < argc; i++) printf("%s ", argv[i]); printf("\n");
    argv[1][0] = 'b';
    for(i = 1; i < argc; i++) printf("%s ", argv[i]); printf("\n");
}

Every place I've tried it (under MacOS and several flavors of Unix and Linux) it has printed

test test 
best test 

Since the second line was never "best best", this proves that, on the tested systems, by the time the second program is run, the strings are no longer aliased.

Of course, this test does not prove that strings in argv can never be aliased, under any circumstances, under any system out there. I think all it proves is that, unsurprisingly, each of the tested operating systems recopies the argument list at least once between the time p1 calls execl and the time that p2 is actually invoked. In other words, the argument vector constructed by the invoking program is not used directly in the called program, and in the process of copying it, it is (again not surprisingly) "normalized", meaning that the effects of any aliasing are lost.

(I say this is not surprising because if you think about the way the exec family of system calls actually work, and the way process memory is laid out under Unix-like systems, there's no way that the invoking program's argument list could be used directly; it has to be copied, at least once, into the address space of the new, exec'ed process. Furthermore, any obvious and straightforward method of copying the argument list is always and automatically going to "normalize" it in this way; the kernel would have to do significant, extra, totally unnecessary work in order to detect and preserve any aliasing.)

Just in case it matters, I modified the first program in this way:

#include <stdio.h>
#include <unistd.h>

int main()
{
    char test[] = "test";
    char *argv[] = {"p2", test, test, NULL};
    execv("./p2", argv);
}

The results were unchanged.


With all of this said, I agree that this issue does seem like an oversight or buglet in the standards. I'm not aware of any clause guaranteeing that the strings pointed to by argv are distinct, meaning that a paranoidly-written program probably can't depend on such a guarantee, no matter how likely it is that (as this answer demonstrates) any reasonable implementation is likely to do it that way.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103