0

For example, consider the call:

>routine -h -s name -t "name also" -u 'name as well'

Would this return 8 arguments or more? Is there a defined standard as to how they are parsed? Where would this be located?

NOTE: I am not interested in code to do this, but the rules or standards that apply. I do not consider reading source code somewhere as documentation of the standard, which I assume must reside somewhere.

Jiminion
  • 5,080
  • 1
  • 31
  • 54
  • 3
    http://www.gnu.org/software/libc/manual/html_node/Getopt.html –  Jun 04 '15 at 14:45
  • Assuming `>` is your prompt, if you type enter, I think the `'` won't be matched with the `"` and so the shell might raise an error – Eregrith Jun 04 '15 at 14:46
  • I'm interested in the content of argv[] itself, not how to handle them/ – Jiminion Jun 04 '15 at 14:47
  • 1
    possible duplicate of [Parse string into argv/argc](http://stackoverflow.com/questions/1706551/parse-string-into-argv-argc) – LPs Jun 04 '15 at 14:47
  • Why not simply print out the contents of `argv`? E.g. `for (int a = 0; a < argc; ++a) printf("argv[%d] = %s\n", a, argv[a]);` – Some programmer dude Jun 04 '15 at 14:48
  • 1
    Parsing command line arguments is handling them. If you are not interested in that, what's your question is about? – n. m. could be an AI Jun 04 '15 at 15:04
  • Assuming a normal shell, the program will never see the arguments; the shell will reject the command because of the mismatched `"` vs. `'`. – Keith Thompson Jun 04 '15 at 15:13
  • @JoachimPileborg Because that would just indicate a specific system. I am interested in knowing if a standard exists. (It sounds like I am confabulating the shell parsing rules for parsing arguments, which are simply shell-parsed entities shoveled into argv[]..... – Jiminion Jun 04 '15 at 15:14
  • @KeithThompson The example wasn't meant to be taken ultra-literally. Does the shell accept single and double-quote delimiters? Does it accept other delimiters as well? – Jiminion Jun 04 '15 at 15:16
  • If the mismatched `"` and `'` in your example was a typo, I suggest you edit the question to correct it. How the shell converts a command line into an `argv` array is a different question than parsing the `argv` array once the program has started. – Keith Thompson Jun 04 '15 at 15:22
  • If your question is about how a standard shell interprets a command line in order to provide argc and argv to a utility, then the question has nothing to do with `C` and the title and tag are misleading. Please clarify. (If you believe that it is the utility or the exec call which does that work, then your model is incorrect.) – rici Jun 04 '15 at 16:55
  • @Jiminion: Well, I did my best to answer the implicit question. Hope it helps. – rici Jun 04 '15 at 17:14
  • Your question still unclear. When you ask how the arguments are "parsed", I would normally assume that you're talking about something that, for example, associates the `name` argument with preceding `-s` option. I think you're trying to ask how the command line string is split up into the elements of the array referenced by `argv`. Ask that. – Keith Thompson Jun 04 '15 at 17:23

3 Answers3

1

The shell in use is responsible for parsing the command line and invoking exec*() appropriately. See the documentation for the specific shell in question to learn about its rules, and see its source code to see how it parses the command line.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
1

A standard command shell is a programming language whose actions are (mostly) invoking "utilities", which are executable programs. The job the shell performs is to set up the standard environment in order to invoke the utility, which includes:

  • Figuring out which executable corresponds to the utility to be invoked;

  • Assigning stdin, stdout and stderr file descriptors for the utility to appropriate streams;

  • Creating the argv argument vector and passing it to the invoked utility;

  • Setting up the utility's environ global, which the utility can access through the getenv standard library function;

Like any programming language, the shell has values, literals, variables, and control flow. It has a syntax (and a very idiosyncratic lexical analysis algorithm). It also has other primitives which are particularly designed for its task.

As an example, /usr/bin and "this is not a sentence" are literal values in the shell language. The quotes around the second of these are not part of the value; they are part of the language's syntax for literal strings. (The shell language allows many literal strings to be written without quotes, and also includes a complicated expression language so that not all double-quoted strings are literals, but in the simple case a quoted string is not conceptually different from a quoted string in C.)

The basic syntax and semantics are standardized by Posix. Many commonly-used shell languages mostly conform to this standard. Almost all provide extensions; some (if not most) are not completely compatible with even the base standard unless specific options are enabled. (For example, for bash, invoking it with the --posix command line argument.) However, the basic principles are generally obeyed, and reading the Posix link above will provide a good overview. It includes a complete grammar.

In general, the procedure is the following:

  • The shell breaks the command line into "words".
  • Some words are "expanded", possibly being replaced by zero or more words.
  • Some words are interpreted as file-descriptor redirections; others as environment variable assignments.
  • If the result is a specific shell syntax, it is executed. Otherwise, the first word is interpreted as either the name of a shell function, a builtin command, or an external utility
  • If the command resolves to an external utility, the words from the command line (other than the ones already used as redirections and assignments) are placed into an argv vector, and the utility is invoked.

It's a lot more complicated than that, but that's the basic model.

Invoking the utility is performed using one of the exec* family of standard library functions, which takes as arguments:

  • The path to an executable
  • A zero-terminated vector of pointers to strings, which will be the argv vector
  • A zero-terminated vector of pointers to strings of the form name=value, which will be the environ global.

The exec call then invokes the external utility. It copies the argument vector and environ list into the utility's address space, but does not otherwise modify or validate the values other than checking that the total size of the two lists does not exceed some system limit.


The rest of this answer pertains to how the utility itself (might or should) parse the argument vector it receives.

There is no standard for interpreting command-line arguments, but there are guidelines and there are standard (and not-so-standard) library routines which impose a kind of de facto standard, which defines what users (might) expect.

To start with, the Posix guidelines are (mostly) implemented by the Posix standard getopt function. These guidelines suggest that optional arguments (those with - flags) precede all positional arguments.

However, not all Posix utilities conform to these suggestions, and it is common to find utilities which "permute" arguments, allowing options to follow positional arguments. This mechanism is (mostly) implemented by the Gnu version of getopt. In addition, Gnu defines (and suggests the use of) the getopt_long function, which allows multicharacter options initiated with --.

In all cases, how optional flag arguments are parsed depends on whether the option is defined as taking an argument or not. So

-s1 word

could be parsed as:

  • If -s takes an argument:
    • option -s with argument "1"
    • positional argument "word"
  • If -s does not take an argument and -1 is a valid flag not taking an argument
    • option -s
    • option -1
    • positional argument "word"
  • If -s does not take an argument and -1 does take an argument:
    • option -s
    • option -1 with argument "word"

In addition to the above, there are also commands which accept "long options" started with a single dash (and thus do not allow short options to be condensed into a single word). This is the style used by TCL, and is followed by many GUI commands. This style can be parsed with the GNU function getopt_long_only (see previous link).

rici
  • 234,347
  • 28
  • 237
  • 341
  • @Jiminion: That's correct. The shell will split the command into 8 words, and that's what you will find in `argv`. If you want to know how the shell splits commands into words, that's documented by Posix (and by the various shells, which may have additional features). If you don't use a shell (say, you call some `exec*` function), then *you* supply `argv` and there is no modification whatsoever. – rici Jun 04 '15 at 16:37
0

POSIX defines the getopt() function and the getopts command (typically built into the shell) to parse command-line arguments.

The standard only allows for single-letter option names, so it would not support your example:

routine -h -s1 name -s2 "name also" -s3 "name as well"

NOTE: In your question, you have "name as well' at the end of your command line. This would be rejected by the shell before your routine even sees its arguments, because of the mismatched quotation marks. I'll assume that was just a typo.

It's common for commands to support extended option syntax. GNU tools, for example, commonly support long names for options, introduced by -- rather than -, in addition to the standard single-letter options. The GNU version of the getopt function is documented here.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • No. Not interested in the parsing of the arguments. Just the content of argv[]. I think the answer saying it depends on exec() is most correct, but the documentation on exec is a big vague. – Jiminion Jun 04 '15 at 15:45
  • 1
    @Jiminion: Then please update your question. You have the word "parse" in the title. If you're only interested in the contents of `argv`, you need to say so in the question. Clarifying your intent in comments is not sufficient; the question itself needs to be clear. – Keith Thompson Jun 04 '15 at 15:53