search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

Question

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string

For example:

Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line

I just want the following result outputted to a file:

16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>

There are multiple examples of these groups of text in a file that I want.

I tried using this below:

cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt

But it only ever shows for 20 lines either side of the string.

Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20.
- ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.

Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to

</eventUpdate>

Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.

`-A` is the number of lines grep will print `A`fter the matched line. If you don't want any lines after the matched line, why are you asking for 20? Also I don't understand your comment about line numbers being inconsistent, but if you want 20 lines `B`efore, use `-B 20` — rici, May 22 '13 at 15:23
I see. This should work: cat filename | grep 4320101 -A 100 -B 20 , but it returns: : No such file or directory — Zippyduda, May 22 '13 at 15:53
I realise the way -A and -B works is that it wants value filename so it thinks eventUpdate is a file. However I want to just search up to eventUpdate (giving it 100 lines to find it below 4320101) — Zippyduda, May 22 '13 at 15:59
There isn't a standard tool to do this job. I have a 70-line (non-minimal) Perl script that does lines before and after; it would have to modified to handle 'after until alternative pattern', but that is a lot easier than 'from specified pattern before 4320101'. I also have a shell script version that uses `ed` (and `sed` and `sort` and `sed` again) on each file - that would be easiest to adapt. Contact me (see my profile) if you want these; they're not fully worked solutions to your problem, but might be useful stepping stones to providing it. — Jonathan Leffler, May 22 '13 at 16:53
What do you want to happen if 2 ranges overlap, e.g. if 2 occurrences of 4320101 lie within 20 lines of each other? Print all the lines once? Print the lines that lie within both ranges twice? — Ed Morton, May 23 '13 at 12:39
It only appears once within 20 lines above itself (that time being itself) and up to eventUpdate, which can never be 21 lines above the next occurence. — Zippyduda, May 24 '13 at 08:03

score 1 · Answer 1 · answered May 22 '13 at 16:30

Here is an ugly awk solution :)

awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR} 
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>

So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working

qwwqwwq · Answer 2 · 2013-05-22T17:35:23.550

Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.

awk '
function store( line ) {
    for( i=0; i <= 20; i++ ) {
        last[i-1] = last[i]; i++;
    };
    last[20]=line;
};
function purge() {
    for( i=20; i >= 0; i-- ) {
        if( length(last[i])==0 || last[i] ~ "Random" ) {
            stop=i;
            break
        };
    };
    for( i=(stop+1); i <= 20; i++ ) {
        print last[i];
    };

};
{
store($0);
if( /4320101/ ) {
    purge();
    printall=1;
    next;
};
if( printall == 1) {
    print;
    if( /eventUpdate/ ) {
        exit 0;
    };
};
}' test

jaypal singh · Answer 3 · 2013-05-28T01:00:22.127

1

You can try something like this -

awk '{ 
    a[NR] = $0
}

/<\/eventUpdate>/ { 
    x = NR
}

END {
    for (i in a) {
        if (a[i]~/4320101/) {
            for (j=i-20;j<=x;j++) {
            print a[j]
            }
        }
    }
}' file

edited May 28 '13 at 01:00

answered May 22 '13 at 17:37

jaypal singh

74,723
23
102
147

score 1 · Answer 4 · answered May 22 '13 at 21:10

Let's see if I understand your requirements:

You have two strings, which I'll call KEY and LIMIT. And you want to print:

At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)

The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.

So let's do it in awk:

#file: extract.awk

# Initialize the circular buffer
BEGIN          { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0  { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
                   print buf[i % 20];
                 hi = 0;
               }
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
               { print; next; }
# Otherwise, save the line
               { buf[count++ % 20] = $0; }

In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:

awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME

Notes:

I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.

potong · Accepted Answer · 2013-05-24T09:31:16.893

1

This might work for you (GNU sed):

sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file

EDIT:

:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.

edited May 24 '13 at 09:31

answered May 22 '13 at 21:27

potong

55,640
6
51
83

This worked perfectly. Just altered it to check 3 lines, read for users input from read ID (in this case 4320102) and then do /'$ID' .I have to ask though, can you break down what all of that does exactly? – Zippyduda May 24 '13 at 09:18

Ed Morton · Answer 6 · 2013-05-24T11:30:36.013

0

The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:

awk '
NR==FNR {
    if ($0 ~ /\<4320101\>/ {
        for (i=NR-20;i<NR;i++)
            range[i]
        inRange = 1
    }
    if (inRange) {
        range[NR]
    }
    if ($0 ~ /<\/eventUpdate>/) {
        inRange = 0
    }
    next
}
FNR in range
' file file

edited May 24 '13 at 11:30

answered May 22 '13 at 18:45

Ed Morton

188,023
17
78
185

Using this I got: awk: cmd. line:9: (FILENAME=test FNR=482) fatal: attempt to use scalar `inRange' as array – Zippyduda May 24 '13 at 08:58
fixed. awk did point you to the line where the error was and tell you what the error was. – Ed Morton May 24 '13 at 11:31

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

6 Answers6

Linked