1

I need to split the bigger file into smaller chunks based on the last occurrence of the pattern in the bigger file using shell script. For eg.

Sample.txt ( File will be sorted based on the third field on which pattern to be searched )

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/>
SOUTHWEST|3456|01134|GDFSG|EWRER|<br/> 

"Pattern 1 = 00003 " to be searched output file must contain sample_00003.txt

NORTH EAST|0004|00001|Fost|Weaather|<br/> 
NORTH EAST|0004|00001|Fost|Weaather|<br/>
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 
SOUTH|0003|00003|Haet|Summer|<br/> 

"Pattren 2 = 00112" to be searched output file must contain sample_00112.txt

EAST|0007|00016|uytr|kert|<br/> 
EAST|0007|00016|uytr|kert|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 
WEST|0002|00112|WERT|fersg|<br/> 

Used

awk -F'|' -v 'pattern="00003"' '$3~pattern big_file' > smallfile

and grep commands but it was very time consuming since file is 300+ MB of size.

shellter
  • 36,525
  • 7
  • 83
  • 90
Katchy
  • 85
  • 1
  • 10
  • What do you mean by "the last occurrence of the pattern"? – codeforester Jan 26 '17 at 01:07
  • Last time the pattern matches in the file. I.e Pattern "00003" matches till 5th line of sample.txt file in its third field. So process want to split it out till that 5th line into an separate file. – Katchy Jan 26 '17 at 04:13
  • 1
    In the future, please use the `{}` format tool at the top left of the edit box on highlighted text to format as code/data/output. Good luck. – shellter Jan 26 '17 at 04:23

2 Answers2

2

Not sure if you'll find a faster tool than awk, but here's a variant that fixes your own attempt and also speeds things up a little by using string matching rather than regex matching.

It processes lookup values in a loop, and outputs everything from where the previous iteration left off through the last occurrence of the value at hand to a file named smallfile<n>, where <n> is an index starting with 1.

ndx=0; fromRow=1
for val in '00003' '00112' '|'; do  # 2 sample values to match, plus dummy value
  chunkFile="smallfile$(( ++ndx ))"
  fromRow=$(awk -F'|' -v fromRow="$fromRow" -v outFile="$chunkFile" -v val="$val" '
    NR < fromRow { next }
    { if ($3 != val) { if (p) { print NR; exit } } else { p=1 } } { print > outFile }
  ' big_file)
done

Note that dummy value | ensures that any remaining rows after the last true value to match are saved to a chunk file too.


Note that moving all the logic into a single awk script should be much faster, because big_file would only have to be read once:

awk -F'|' -v vals='00003|00112' '
  BEGIN { split(vals, val); outFile="smallfile" ++ndx }
  { 
    if ($3 != val[ndx]) { 
      if (p) { p=0; close(outFile); outFile="smallfile" ++ndx } 
    } else { 
      p=1 
    } 
    print > outFile
  }
' big_file
mklement0
  • 382,024
  • 64
  • 607
  • 775
0

You can try with Perl:

 perl -ne '/00003/ && print' big_file > small_file

and compare its timing with other solutions...

EDIT

Limiting my answer to the tools you didn't try already... you can also use:

sed -n '/00003/p' big_file > small_file

But I tend to believe perl will be faster. Again... I'd suggest you to measure the elapsed for different solutions on your own.

mauro
  • 5,730
  • 2
  • 26
  • 25
  • @mklement0: I guess you tested the performance of these "flawed attempts" before commenting... – mauro Jan 27 '17 at 04:27
  • I think you misunderstood, so let me try to explain it differently: The OP described a problem and also included an _attempt_ at a solution (the `awk` command). That attempt is _technically_ flawed, but more importantly, it is _fundamentally conceptually flawed_ - even if fixed, the attempt would not solve the problem. Your answer contains _technically_ correct commands that are the equivalent of the conceptually flawed attempt and therefore do not solve the OP's problem. – mklement0 Jan 27 '17 at 12:24