A standard command shell is a programming language whose actions are (mostly) invoking "utilities", which are executable programs. The job the shell performs is to set up the standard environment in order to invoke the utility, which includes:
Figuring out which executable corresponds to the utility to be invoked;
Assigning stdin, stdout and stderr file descriptors for the utility to appropriate streams;
Creating the argv
argument vector and passing it to the invoked utility;
Setting up the utility's environ
global, which the utility can access through the getenv
standard library function;
Like any programming language, the shell has values, literals, variables, and control flow. It has a syntax (and a very idiosyncratic lexical analysis algorithm). It also has other primitives which are particularly designed for its task.
As an example, /usr/bin
and "this is not a sentence"
are literal values in the shell language. The quotes around the second of these are not part of the value; they are part of the language's syntax for literal strings. (The shell language allows many literal strings to be written without quotes, and also includes a complicated expression language so that not all double-quoted strings are literals, but in the simple case a quoted string is not conceptually different from a quoted string in C.)
The basic syntax and semantics are standardized by Posix. Many commonly-used shell languages mostly conform to this standard. Almost all provide extensions; some (if not most) are not completely compatible with even the base standard unless specific options are enabled. (For example, for bash
, invoking it with the --posix
command line argument.) However, the basic principles are generally obeyed, and reading the Posix link above will provide a good overview. It includes a complete grammar.
In general, the procedure is the following:
- The shell breaks the command line into "words".
- Some words are "expanded", possibly being replaced by zero or more words.
- Some words are interpreted as file-descriptor redirections; others as environment variable assignments.
- If the result is a specific shell syntax, it is executed. Otherwise, the first word is interpreted as either the name of a shell function, a builtin command, or an external utility
- If the command resolves to an external utility, the words from the command line (other than the ones already used as redirections and assignments) are placed into an
argv
vector, and the utility is invoked.
It's a lot more complicated than that, but that's the basic model.
Invoking the utility is performed using one of the exec*
family of standard library functions, which takes as arguments:
- The path to an executable
- A zero-terminated vector of pointers to strings, which will be the
argv
vector
- A zero-terminated vector of pointers to strings of the form
name=value
, which will be the environ
global.
The exec
call then invokes the external utility. It copies the argument vector and environ list into the utility's address space, but does not otherwise modify or validate the values other than checking that the total size of the two lists does not exceed some system limit.
The rest of this answer pertains to how the utility itself (might or should) parse the argument vector it receives.
There is no standard for interpreting command-line arguments, but there are guidelines and there are standard (and not-so-standard) library routines which impose a kind of de facto standard, which defines what users (might) expect.
To start with, the Posix guidelines are (mostly) implemented by the Posix standard getopt function. These guidelines suggest that optional arguments (those with -
flags) precede all positional arguments.
However, not all Posix utilities conform to these suggestions, and it is common to find utilities which "permute" arguments, allowing options to follow positional arguments. This mechanism is (mostly) implemented by the Gnu version of getopt. In addition, Gnu defines (and suggests the use of) the getopt_long function, which allows multicharacter options initiated with --
.
In all cases, how optional flag arguments are parsed depends on whether the option is defined as taking an argument or not. So
-s1 word
could be parsed as:
- If
-s
takes an argument:
- option
-s
with argument "1"
- positional argument "word"
- If
-s
does not take an argument and -1
is a valid flag not taking an argument
- option
-s
- option
-1
- positional argument "word"
- If
-s
does not take an argument and -1
does take an argument:
- option
-s
- option
-1
with argument "word"
In addition to the above, there are also commands which accept "long options" started with a single dash (and thus do not allow short options to be condensed into a single word). This is the style used by TCL, and is followed by many GUI commands. This style can be parsed with the GNU function getopt_long_only
(see previous link).