Extracting pattern which does not necessarily repeat

Question

I am working with ANSI 835 plain text files and am looking to capture all data in segments which start with “BPR” and end with “TRN” including those markers. A given file is a single line; within that line the segment can, but not always, repeats. I am running the process on multiple files at a time and ideally I would be able to record the file name in which the segment(s) occur. Here is what I have so far, based on an answer to another question:

#!/bin/sed -nf
/BPR.*TRN/ {
   s/.*\(BPR.*TRN\).*/\1/p
   d
 }
 /from/ {
     : next
     N
     /BPR/ {
        s/^[^\n]*\(BPR.*TRN\)[^n]*/\1/p
        d
      }
      $! b next
}

I run all files I have through this and write the results to a file which looks like this:

BPR*I*393.46*C*ACH*CCP*01*011900445*DA*0000009046*1066033492**01*071923909*DA*72
34692932*20150120~TRN
BPR*I*1611.07*C*ACH*CCP*01*031100209*DA*0000009108*1066033492**01*071923909*DA*7
234692932*20150122~TRN
BPR*I*1415.25*C*CHK************20150108~TRN
BPR*H*0*C*NON************20150113~TRN
BPR*I*127.13*C*CHK************20150114~TRN
BPR*I*22431.28*C*ACH*CCP*01*071000152*DA*99643*1361236610**01*071923909*DA*72346
92932*20150112~TRN
BPR*I*182.62*C*ACH*CCP*01*071000152*DA*99643*1361236610**01*071923909*DA*7234692
932*20150115~TRN

Ideally each line would be prepended with the file name like this:

IDI.Aetna.011415.64539531.rmt:BPR*I*393.46*C*ACH*CCP*01*011900445*DA*0000009046*1066033492**01*071923909*DA*72
34692932*20150120~TRN
IDI.BCBSIL.010915.6434438.rmt:BPR*I*1611.07*C*ACH*CCP*01*031100209*DA*0000009108*1066033492**01*071923909*DA*7
234692932*20150122~TRN
IDI.CIGNA.010215.64058847.rmt:BPR*I*1415.25*C*CHK************20150108~TRN
IDI.GLDRULE.011715.646719.rmt:BPR*H*0*C*NON************20150113~TRN
IDI.MCREIN.011915.6471442.rmt:BPR*I*127.13*C*CHK************20150114~TRN
IDI.UHC.011915.64714417.rmt:BPR*I*22431.28*C*ACH*CCP*01*071000152*DA*99643*1361236610**01*071923909*DA*72346
92932*20150112~TRN
IDI.UHC.011915.64714417.rmt:BPR*I*182.62*C*ACH*CCP*01*071000152*DA*99643*1361236610**01*071923909*DA*7234692
932*20150115~TRN

The last two lines would be an example of a file where the segment pattern repeats.

Again, prepending each line with the file name is ideal. What I really need is to be able to process a given single-line file which has the “BPR…TRN” segment repeating and write all segments in that file to my output file.

Would you show some sample input? In particular, the the question states "A given file is a single line" yet your sample code goes to lengths to remove newline characters. Also the sample code looks for lines containing `from` yet your description makes no mention of why `from` is important. Some sample input might help clarify. — John1024, Jan 22 '15 at 00:59
sed is 100% the wrong tool for this job so throw that sed script away as most of the constructs it's using became obsolete in the mid-1907s when awk was invented and start again by posting some sample input and expected output. — Ed Morton, Jan 22 '15 at 01:21
Can you use COBOL for this? I think that language is popular for this problem domain. — John Zwinck, Jan 22 '15 at 01:58

Danny Daglas · Answer 1 · 2015-01-22T02:15:26.833

1

Try with awk:

awk '
    /BPR/ { sub(".*BPR","BPR") }
    /TRN/ { sub("TRN.*","TRN") }
    /BPR/,/TRN/ { print FILENAME ":" $0 }
' *.rmt

edited Jan 22 '15 at 02:15

answered Jan 22 '15 at 02:03

Danny Daglas

1,501
1
9
9

awk does prepend the file name. It still doesn't write any segments past the first. A given file is a single line with no CRLF. Additional comments above. – rcfrank Jan 22 '15 at 14:32

Extracting pattern which does not necessarily repeat

1 Answers1