2

I'm trying to craft a log file summarisation tool for an application that creates a lot of duplicate entries with only a different suffix to indicate point of execution.

Here's a genericized version: A text_file (infile_grocery.txt) with these contents.

milk skim fruit apple banana
milk skim fruit orange
milk skim fruit mango
milk skim fruit pomegranate
milk 2 percent fruit cherry tomato
milk 2 percent fruit peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

What I'm hoping to get is:

milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple

The command line I've currently cooked up is:

sed -rn "{H;x;s|^(.+) fruit ([^\n]+)\n(.*)\1 fruit (.+)$|\1 fruit \2, \4|;x}; ${x;s/^\n//;p}" infile_grocery.txt

But the results I'm getting are:

milk skim fruit apple banana, mango, strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

I'm discarding input somehow. Any gurus with a better idea how to structure this?

mthespian
  • 23
  • 2
  • 2
    How are duplicate lines identified? –  Aug 13 '12 at 12:29
  • Suffix is words after the first three. Also join should take place only when lines with similar prefix are consequtive. Also suffixes should be joined using comma. Right? – nshy Aug 13 '12 at 12:36
  • Not necessarily words after the first three. The divider is the word "fruit". Anything before it needs to match in consecutive lines for them to be eligible for modification. Anything after it should be joined using a comma to the end of the previous line. – mthespian Aug 13 '12 at 17:29

3 Answers3

4

This is a awk solution.

awk -F fruit '
$1==x{
    printf ",%s", $2
    next
}
{
    x=$1
    printf "\n%s", $0
}
END {
    print ""
}' input.txt 

Output

milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple
kev
  • 155,172
  • 47
  • 273
  • 272
0
opref=""
nline=""
while read line; do
  pref=`echo $line | sed 's/\(.*fruit\).*/\1/'`
  item=`echo $line | sed 's/.*fruit\s\(.*\)/\1/'`
  if [ "$opref" == "$pref" ]; then
    nline="$nline, $item"
  else
    [ "$nline" != "" ] && echo $nline
    nline=$line
  fi  
  opref=$pref
done < input_file
perreal
  • 94,503
  • 21
  • 155
  • 181
0

This might work for you (GNU sed):

sed ':a;$!N;s/^\(\(.*fruit\).*\)\n\2\(.*\)/\1,\3/;ta;P;D' file

Explanation:

  • :a is a place holder for a loop
  • $!N append a newline followed by the next line except on the last line.
  • s/^\(\(.*fruit\).*\)\n\2\(.*\)/\1,\3/ collect everything upto the newline into back reference 1 (aka \1). Within this collect everything from the beginning of the line upto and including the word fruit into back reference 2 (aka \2). Collect everything following the matching \2 into back reference 3 (aka \3). Replace this regexp with back reference 1, followed by a comma, a space and then back reference 3.
  • ta if the substitution was true loop to place holder :a
  • P if the substitution was false print upto and including the first newline in the pattern space.
  • D if the substitution was false delete upto and including the first newline in the pattern space.
potong
  • 55,640
  • 6
  • 51
  • 83
  • Thanks muchly! Any chance you'd feel up to providing a walkthrough of the command to help me understand a better approach for my next problem? – mthespian Aug 13 '12 at 19:12
  • On reflection in the substitution regexp the first two `.*`'s should be replaced by `[^\n]*`'s. This would be more efficient by preventing the regexp engine from double backtracking. – potong Aug 15 '12 at 06:17