2

I have a common enough problem with a powershell regex to read multi-line records. I've read the threads asking similar questions but can't quite get the solutions to work in my case.

My file consists of multi-line records of variable length. The records I am interested in start with a 01 or a 02 followed by a V or a M. The record ends whenever another record begins or when a batch record starting with '50' is found. The first three characters of each line identifies the record.

ie 01V (Start of record - content follows here) 01

I'm trying to read the individual records with a regex by identifying the start and the end.

What I have at the moment is based off this answer: Match everything between two words in Powershell

#Read the file as a single string
$FilePath = "blaablaablaa"
$TestFile = get-content $FilePath | Out-String 

#( ?= Assert that this matches before the current position
# 0(1|2)(V|M) if the record is 01V or 01M or 02V or 02M 
# ) End assertion 
# .+? Match any number of characters, few as possible
# (?= Until a record starting with 70 is found  
# ) End of look ahead
$regex = [regex] '(?is)(?<=0(1|2)(V|M)).+?(?=70)'
echo $TestFile |  select-string -Pattern $regex 

The above will work with single lines strings if I remove the pipe to out-sting with with the out-string pipe it returns the entire file. I'm guessing I'm not handling the /n characters correctly.

Any advice? The input file looks roughly like this:

00 date
01Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01 01 01 01=0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01=5xxxxxxxxxxxxxxxxxxxxxxxxxxx
01Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01 01 01 01=0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01=9xxxxxxxxxxxxxxxxxxxxxxxxxxx
50 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxx
01Vxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$B 0xxxxxxxxxxxxxxxxxxxx
01$0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$5xxxxxxxxxxxxxxxxxxxxxxxxxxx
50 xxxxxxxxxxxx BatchTotal
90 xxxxxxxxxxxx FILETotal

The required output is splitting out the file into individual records which are delimited by a '50' or a '90' or the start of another record. This for example is the final record :-

01Vxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$B 0xxxxxxxxxxxxxxxxxxxx
01$0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$5xxxxxxxxxxxxxxxxxxxxxxxxxxx

Community
  • 1
  • 1
user3046742
  • 235
  • 1
  • 11
  • 1
    could you provide an example along with expected output? – Avinash Raj Jan 14 '15 at 12:21
  • Your description and example don't match up. There's a `01Mxxx` line that you describe as wanting to match but exclude from you desired example. You speak of lines starting with `50` to end the match, your code's comment says it'll search till it finds `70`, yet your example ends when it encounters `71`. As well the example doesn't contain `01^` as the 2nd line, it goes to `01$` immediately. – asontu Jan 14 '15 at 12:48
  • Sorry, I edited the example to be more clear. Basically if a line starts with 01V or 01M, I want to capture everything until another line starting with 01V or 01M or 50 or 90 is encountered. – user3046742 Jan 14 '15 at 13:05

2 Answers2

1

Assuming (by your description) you also want to match the part from 01M untill the next 01M, and then that one separately until the 50. This would do the trick:

(?gmis)^0[12][VM](?:[^\n]|\n(?!0[12][VM]|50|90))+

Explanation: after matching 0, 1 or 2, V or M, The part in the (?:...) is this:

[^\n]|\n(?!0[12][VM]|50|90)

Which means:

match any character that isn't a new-line

OR

a newline that is not followed (?!...) by either the beginning of a new record or 50 or 90.

online Regex101 demo

asontu
  • 4,548
  • 1
  • 21
  • 29
  • That works perfectly in Regex101 demo. But when I use it in my powershell script, it only returns single lines like "01Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" instead f the full multiline record. If I pipe my file through out-string, it returns the entire file again. Must be something about the /n characters throwing the regex off. – user3046742 Jan 14 '15 at 14:07
0

Using your test data:

@'
00 date
01Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01 01 01 01=0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01=5xxxxxxxxxxxxxxxxxxxxxxxxxxx
01Mxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01 01 01 01=0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01=9xxxxxxxxxxxxxxxxxxxxxxxxxxx
50 xxxxxxxxxxxxx xxxxxxxxxxxxxxxxx
01Vxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$B 0xxxxxxxxxxxxxxxxxxxx
01$0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$5xxxxxxxxxxxxxxxxxxxxxxxxxxx
50 xxxxxxxxxxxx BatchTotal
90 xxxxxxxxxxxx FILETotal
'@ | set-content testfile.txt


$Text = Get-Content ./testfile.txt -Raw

$regex = @'
(?ms)(01(?:M|V).+?)
(?:5|9)0.+?
'@


$Records = 
[regex]::Matches($Text,$regex) |
foreach {$_.groups[1].value}

$Records[-1]

01Vxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$A xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$B 0xxxxxxxxxxxxxxxxxxxx
01$0xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
01$5xxxxxxxxxxxxxxxxxxxxxxxxxxx
mjolinor
  • 66,130
  • 7
  • 114
  • 135