Trouble trimming whitespace when piping grep to awk

Question

I am trying to write a simple wrapper for grep in order to put its output in a more readable format. This includes putting the matched string (which occurs after the second colon) on a new line, and trimming any leading whitespace/tabs from the matched string.

So instead of doing the following:

$ grep -rnIH --color=always "grape" .

./apple.config:1:   Did you know that grapes are tasty?

I would like to be able to get this:

$ grep -rnIH --color=always "grape" . | other-command

./apple.config:1:   
Did you know that grapes are tasty?

I have tried many different methods to try to do this, including using sed, awk itself, substitution, perl etc. One important thing to keep in mind is that I want to trim leading space from $3, but that $3 may not actually contain the entire matched string (for example, if the matched string contains a url with ":" characters).

So far I have gotten to the point that I have the following.

$ grep -rnIH --color=always "grape" . | \
      awk -F ":" '{gsub(/^[ \t]+/, "", $3); out=""; for(i=4;i<=NF;i++){out=out$i}; print $1":"$2"\n"$3out}'

./apple.config:1:   
    Did you know that grapes are tasty?

The gsub is intended to trim whitespace/tabs from the start of whatever occurs right after the second colon. Then the for loop is intended to build a variable made up of anything else in the matched string that may have gotten split by the field separator ":".

I greatly appreciate any help in getting the leading whitespace to be trimmed properly.

`other-command` could be `sed 's/:[[:blank:]]*/\n/2'` -- probably requires GNU sed for the "2" flag — glenn jackman, Nov 27 '15 at 19:38
@glennjackman - close! Using `2` as a flag works in at least FreeBSD's sed. The GNUism would be using `\n` in the replacement string. Make this `sed $'s/:[[:blank:]]*/\\\n/2'`, and it might in fact be portable! — ghoti, Aug 29 '17 at 05:20

fedorqui · Answer 1 · 2015-11-27T19:20:10.643

3

To me it looks like you want to match a line and, in that case, show it like

file:line_number
line with the match

For this, you can directly use awk:

awk -v OFS=":" '/pattern/ {print FILENAME, NR;  print}' files*

FILENAME stands for the file you are reading.
NR stands for the line number.
OFS stands for Output Field Separator, so that when you say print a, b the separator is :.

And to remove the leading or trailing spaces, you can use gsub(/(^ *| *$)/,""), so that all together it looks like:

awk -v OFS=":" '/and/ {print FILENAME, NR;  gsub(/(^ *| *$)/,""); print}' files*

See an example:

$ tail a b
==> a <==
hello
this is some test
         and i am done now

==> b <==
and here i am
done

Now let's try to match lines containing "and":

$ awk -v OFS=":" '/and/ {print FILENAME, NR;  gsub(/(^ *| *$)/,""); print}' a b
a:3
and i am done now
b:4
and here i am

edited Nov 27 '15 at 19:20

answered Nov 27 '15 at 19:14

fedorqui

275,237
103
548
598

1

Just totally misread the question, and thought you had misread it instead. Sorry! – miken32 Nov 27 '15 at 19:35
1

The only issue with this solution is that the original was colouring the output, and this does not. Otherwise, it is a fine solution. I'm not sure if a `grep --color=always -B1 grape` (or whatever word is being searched for — 'and' in the answer) as a post-processor would fix things appropriately. Possibly not; `grep` tends to put separators out between blocks of text. (For example, `printf "%s\n" a b c b d b" | grep -B1 --color=always b` outputs lines containing just `--` (two dashes). I suppose a post-post-processor: `grep -v '^--$'` would deal with that, but it is getting a bit icky.) – Jonathan Leffler Nov 27 '15 at 22:33
@JonathanLeffler for such cases, there is the magic `--no-group-separator` option that prevents having those `--` in between matches. So yes, your suggestion is great! `my solution | grep --no-group-separator --color=always -B1 grape` should make it. – fedorqui Nov 27 '15 at 22:39
I prefer how your answer handles the leading whitespace trimming before adding the color escape sequences (if piping to grep as @JonathanLeffler suggests). If I was wrapping this up in a bash function as I did below, what would you suggest to keep awk from processing subdirectories thinking they are files? I can specify only files by piping find into awk, but that would create the double piping you pointed out was not ideal in my solution. Good learning experience, thanks. – user1764386 Nov 30 '15 at 21:28
@user1764386 that's a very good question and I don't have a solution for it right now. You may say `awk '...' *` and awk will show some errors when matching a directory, so you can redirect them to stderr with `awk '...' * 2>/dev/null`. – fedorqui Dec 01 '15 at 10:08
@user1764386 so I just asked a question about this: [How to skip a directory in awk?](http://stackoverflow.com/q/34018063/1983854) – fedorqui Dec 01 '15 at 10:28

user1764386 · Accepted Answer · 2015-11-28T23:54:41.163

I ended up using a combination of grep, awk, and sed to solve my problem and produce the desired output format. I wanted to keep the coloured output that grep provides when the "--color=always" option is used, which initially steered me away from using awk to perform the file contents matching.

The tricky bit was that the coloured grep output was producing the color codes in unexpected locations. It was therefore not possible to trim the leading whitespace from a line that in fact began with a colour code. The second tricky part was that I needed to ensure that matched strings containing the awk file separator (":" in my case) we reproduced properly.

I made the following bash wrapper function finds() in order to recursively search file contents in a directory quickly.

#--------------------------------------------------------------#
# Search for files whose contents contain a given string.      #
#                                                              #
# Param1: Substring to recursively search for in file contents.#
# Param2: Directory in which to search for files. [optional].  #
# Return: 0 on success, 1 on failure.                          #
#--------------------------------------------------------------#
finds() {
    # Error if:
    # - Zero or more than two arguments were provided.
    # - The first argument contains an empty string.
    if [[ ( $# -eq 0  ) || ( $# -gt 2  ) || ( -z "$1" ) ]]
    then
        echo "About: Search for files whose contents contain a given string."
        echo "Usage: $FUNCNAME string [path-to-dir]"
        echo "* string     : string to recursively search for in file contents"
        echo "* path-to-dir: directory in which to search files. [OPTIONAL]"

        return 1 # Failure
    fi

    # (r)ecursively search, show line (n)umbers.
    # (I)gnore binaries, s(H)ow filenames.
    grep_flags="-rnIH"

    if [ $# -eq 1 ]; then # No directory given; search from current directory.
        rootdir="."
    else # Search from specified directory.
        rootdir="$2"
    fi

    # The default color code, with brackets
    # escaped by backslashes.
    def_color="\[m\[K"

    grep $grep_flags --color=always "$1" $rootdir | 
    awk '
    BEGIN {
        FS = ":"
    }
    {
        print $1":"$2
        out = $3
        for(i=4; i<=NF; i++) {
            out=out":"$i
        }
        print out
    }' |
    sed -e "s/$def_color\s*/$def_color/"

    return 0 # Success
}

grep is used to recursively look for matching strings in the contents of those files contained in the specified directory.
awk is used to print "filename:linenumber", then build a variable holding the rest of the arguments, separated by the field separator character ":". This allows us to recombine the rest of the matched string, in case it was divided by the initial split (e.g. urls containing "http://").
sed is used to trim any leading whitespace/tabs from the output lines. Here it matches the default color code (followed by a variable amount of space) and replaces it with itself (without the trailing space).

Setting the correct value of def_color

I am unable to display the correct value of def_color in the above codebox (the \[m\[K shown above in the code is not correct). To get the correct ANSI escape sequence to use for this variable:

Redirect the output of grep --color=always to a text file.
Copy and paste the highlighted sequence below as the value of def_color in the finds() function above.
Add a "\" escape character before each bracket.

Code to write colored grep output to a text file:

$ cd orange_test/
$ cat orange1.txt
I like to eat oranges.
$grep -r --color=always "orange" . > ./grep_out.txt

Using the function

The following shows the output produced by the function. Note that you can also specify a directory path in the second parameter.

cheese_test/cheese1.txt

I like to eat cheese.

    Do you all like cheese?

   I like
when the cheese is
on my pizza.

you can find out more about
      cheese at http://cheeseisgood.com

cheesestick

Nice to have a documented answer. However, I don't think having `grep | awk | sed` is very optimal. In general, whenever you have so many pipes, scratch your head and think if `awk` can handle it alone. As I said in my answer, `awk` alone can provide you all the output you are looking for in a more robust way. And if you need the colours, check what Jonathan Leffler suggested below. — fedorqui, Nov 30 '15 at 10:08

Trouble trimming whitespace when piping grep to awk

2 Answers2

Setting the correct value of def_color

Using the function