17

I am trying to understand how grep works.

When I say grep "hello" *.*, does grep get 2 arguments — (1) string to be searched i.e. "hello" and (2) path *.*? Or does the shell convert *.* into something that grep can understand?

Where can I get source code of grep? I came across this GNU grep link. One of the README files says its different from unix grep. How so?

I want to look at source of FreeBSD version of grep and also Linux version of it (if they are different).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
hari
  • 9,439
  • 27
  • 76
  • 110
  • 1
    As a reference here is the UNIX man page for grep: http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?grep Here is the FreeBSD version: http://www.freebsd.org/cgi/man.cgi?query=grep and here is the Linux version: http://linux.die.net/man/1/grep – Devin M Aug 21 '11 at 07:11
  • 1
    A great place to browse old UNIX source code is http://www.tuhs.org – luser droog Aug 21 '11 at 07:49
  • @luser droog: Thanks for the link, its amazing :) – hari Aug 22 '11 at 01:56
  • 1
    @hari: If you're brave enough, take a look at the C compiler in version 3; it's truly frightening. Pointers are declared `a[]` rather than `*a`! – luser droog Aug 22 '11 at 04:23
  • @luser droog: I can surely try! where can I look at that source? – hari Aug 22 '11 at 04:42
  • @hari: I misremembered; it's from version 2. http://minnie.tuhs.org/cgi-bin/utree.pl?file=V2/c/ncc.c – luser droog Aug 22 '11 at 05:08
  • you need to see this "where the Grep came from"https://www.youtube.com/watch?v=NTfOnGZUZDk – Faisal Naseer May 11 '20 at 16:20

4 Answers4

22

The power of grep is the magic of automata theory. GREP is an abbreviation for Global Regular Expression Print. And it works by constructing an automaton (a very simple "virtual machine": not Turing Complete); it then "executes" the automaton against the input stream.

The automaton is a graph or network of nodes or states. The transition between states is determined by the input character under scrutiny. Special automatons like + and * work by having transitions that loop back to themselves. Character classes like [a-z] are represented by a fan: one start node with branches for each character out to the "spokes"; and usually the spokes have a special "epsilon transition" to a single final state so it can be linked up with the next automaton to be built from the regular expression (the search string). The epsilon transitions allow a change of state without moving forward in the string being searched.

Edit: It appears I didn't read the question very closely.

When you type a command-line, it is first pre-processed by the shell. The shell performs alias substitutions and filename globbing. After substituting aliases (they're like macros), the shell chops up the command-line into a list of arguments (space-delimited). This argument list is passed to the main() function of the executable command program as an integer count (often called argc) and a pointer to a NULL-terminated ((void *)0) array of nul-terminated ('\0') char arrays.

Individual commands make use of their arguments however they wish. But most Unix programs will print a friendly help message if given the -h argument (since it begins with a minus-sign, it's called an option). GNU software will also accept a "long-form" option --help.

Since there are a great many differences between different versions of Unix programs the most reliable way to discover the exact syntax that a program requires is to ask the program itself. If that doesn't tell you what you need (or it's too cryptic to understand), you should next check the local manpage (man grep). And for gnu software you can often get even more info from info grep.

luser droog
  • 18,988
  • 3
  • 53
  • 105
  • 2
    Just a nit-pick: GREP is not an abbreviation for General Regular Expression Parser. It's a contraction of the vi/ex-mode command `:g/re/p` which stands for *global/regular expression/print*. – Jens Aug 23 '11 at 09:22
  • 2
    @Jens: `grep` predates `vi/ex` by a bit; it was the `ed` command `g/re/p` that it simulates, which was translated into `:g/re/p` in `vi` mode (and is also `g/re/p` in `ex`). But the gist is correct. – Jonathan Leffler Oct 23 '12 at 14:10
13

The shell does the globbing (conversion from * form to filenames). You can see this by if you have a simple C program:

#include <stdio.h>

int main(int argc, char **argv) {
    for(int i=1; i<argc; i++) {
        printf("%s\n", argv[i]);
    }
    return 0;
}

And then run it like this:

./print_args *

You'll see it prints out what matched, not * literally. If you invoke it like this:

./print_args '*'

You'll see it gets a literal *.

icktoofay
  • 126,289
  • 21
  • 250
  • 231
  • Thanks for the answer. In case of `grep`, what does it get as second argument? list of files? or one file at a time? – hari Aug 21 '11 at 16:52
  • @hari: Think about `grep` as that simple C program. If you give `grep` a `*`, it will end up with a bunch of filenames as extra arguments. If you quote the `*` like `'*'`, though, then the shell won't do globbing and `grep` will only get the single `*` argument (which will probably fail unless you have a file named exactly `*` in your current directory). – icktoofay Aug 21 '11 at 20:15
  • Wow, so if that particular dir has 1000 files, `*` would give it 1000 arguments? (each as a single file)? – hari Aug 21 '11 at 21:51
  • @hari: Yes; try it. If you have a bunch of files you need to operate on like that, you may be better off using `find` and `xargs`. – icktoofay Aug 21 '11 at 21:58
  • 4
    Technically, `echo *` is more that sufficient to demonstrate the globbing. – GreyCat Aug 21 '11 at 22:43
6

The shell expands the '*.*' into a list of file names and passes the expanded list of file names to the program such as grep. The grep program itself does not do expansion of file names.

So, in answer to your question: grep does not get 2 arguments; the shell converts '*.*' into something grep can understand.

GNU grep is different from Unix grep in supporting extra options, such as -w and -B and -A.

It looks to me like FreeBSD uses the GNU version of grep:

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
3

How grep sees the wildcard argument depends on your shell. (Standard) Bourne shell has a switch (-f) to disable file name globbing (see man pages).

You may activate this switch in a script with

set -f
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
PeterMmm
  • 24,152
  • 13
  • 73
  • 111
  • 1
    Yes, but that's rarely used. All (?) Unix/Linux shells do wildcard expansion by default. And if you disable globbing and type `grep "hello" *.*`, `grep` will see `*.*` and treat it as a file name (and probably fail unless you happen to have a file with that name). – Keith Thompson Aug 21 '11 at 07:36