Slice 3TB log file with sed, awk & xargs?

Question

I need to slice several TB of log data, and would prefer the speed of the command line. I'll split the file up into chunks before processing, but need to remove some sections.

Here's an example of the format:

uuJ oPz eeOO    109 66  8
uuJ oPz eeOO    48  0   221
uuJ oPz eeOO    9   674 3
kf iiiTti oP    88  909 19
mxmx lo uUui    2   9   771
mxmx lo uUui    577 765 27878456

The gaps between the first 3 alphanumeric strings are spaces. Everything after that is tabs. Lines are separated with \n.

I want to keep only the last line in each group.

If there's only 1 line in a group, it should be kept.

Here's the expected output:

uuJ oPz eeOO    9   674 3
kf iiiTti oP    88  909 19
mxmx lo uUui    577 765 27878456

How can I do this with sed, awk, xargs and friends, or should I just use something higher level like Python?

glenn jackman · Answer 1 · 2012-05-14T15:09:44.807

4

awk -F '\t' '
  NR==1 {key=$1} 
  $1!=key {print line; key=$1} 
  {line=$0}
  END {print line}
' file_in > file_out

edited May 14 '12 at 15:09

answered May 14 '12 at 14:50

glenn jackman

238,783
38
220
352

With this I get an identical copy of the infile. Note that lines can't be compared like for like as they all contain different numbers after the strings. The numbers associated with the last item in the group are to be kept. – HappyTimeGopher May 14 '12 at 14:55
@tripleee Yep the last line is cut off as well. But the format is still incorrect - see the expect output section of the question. – HappyTimeGopher May 14 '12 at 14:58
I removed my comment because I thought I was mistaken; but it seems it's correct after all: the last line of output is missing. – tripleee May 14 '12 at 15:02
This and the other solution compare the first tab-delimited field. If you don't have tabs in your input after all, you would see the entire input file minus the last line. Are you sure your problem description is correct? – tripleee May 14 '12 at 15:03
Thanks @tripleee - I copied and pasted the SO version of my example (which swapped tabs for spaces) for testing. – HappyTimeGopher May 14 '12 at 15:07
@HappyTimeGopher, I forgot to print the line at the end. updated. – glenn jackman May 14 '12 at 15:10
@glennjackman: `NR==FNR || $1!=key {print line; key=$1}` will work for empty file too. – Prince John Wesley May 14 '12 at 15:17

Michał Kosmulski · Accepted Answer · 2012-05-14T19:33:03.227

2

Try this:

awk 'BEGIN{FS="\t"}
    {if($1!=prevKey) {if (NR > 1) {print lastLine}; prevKey=$1} lastLine=$0}
    END{print lastLine}'

It saves the last line and prints it only when it notcies that the key has changed.

edited May 14 '12 at 19:33

answered May 14 '12 at 14:59

Michał Kosmulski

9,855
1
32
51

This doesn't produce the expected output as in the question. The numbers in each line are different, and a direct comparison with previous lines will always fail. – HappyTimeGopher May 14 '12 at 15:00
Yep, me too. Works Perfectly now I've fixed the test file :) – HappyTimeGopher May 14 '12 at 15:07
There's a slight bug if the key field can be empty; you can perhaps synthesize something out of this one and @glenn jackman's solution if that's a problem for your scenario. – tripleee May 14 '12 at 15:10
@tripleee I changed the condition so it works with empty key, too (checking for first line instead of empty key). – Michał Kosmulski May 14 '12 at 19:33

score 0 · Answer 3 · answered May 14 '12 at 22:59

0

This might work for you:

 sed ':a;$!N;/^\(\S*\s\S*\s\S*\)[^\n]*\n\1/s//\1/;ta;P;D' file

answered May 14 '12 at 22:59

potong

55,640
6
51
83

Slice 3TB log file with sed, awk & xargs?

3 Answers3