0

I have a file, which has the below format. I'm trying to find how many number of such format exists.

Complete Format : When i say complete format, it should start with the line PATTERN 123456789.000 10.10.10.10 1 10 0 - followed by any text and ends by PATTERN 9876543210.000 1.1.1.1 1 10 9 any_string2

Scenario1: "2 complete Formats"

PATTERN 123456789.000 10.10.10.10 1 10 0 -    #Data-set1 starts
can be anything here of any number of lines
and not concerned 
PATTERN 123456789.000 10.10.10.10 1 10 9 any_string1   ##Data-set1 ends
PATTERN 9876543210.000 1.1.1.1 1 10 0 -         #Data-set2 starts
can be anything here of any number of lines
and not concerned 
PATTERN 9876543210.000 1.1.1.1 1 10 9 any_string2     #Data-set2 ends

Scenario2: 1 Complete format and 1 error/incorrect format

PATTERN 123456789.000 10.10.10.10 1 10 0 -    #Data-set1 starts
can be anything here of any number of lines
and not concerned 
PATTERN 123456789.000 10.10.10.10 1 10 9 any_string1   ##Data-set1 ends
#Missing that begin line PATTERN for Data-set2
can be anything here of any number of lines
and not concerned 
PATTERN 9876543210.000 1.1.1.1 1 10 9 any_string2     #Data-set2 ends

Scenario3: None of the format created successfully

PATTERN 123456789.000 10.10.10.10 1 10 0 -    #Data-set1 starts
can be anything here of any number of lines
and not concerned 
#Missing end PATTERN line for Data-set1
#Missing that begin line PATTERN for Data-set2
can be anything here of any number of lines
and not concerned 
PATTERN 9876543210.000 1.1.1.1 1 10 9 any_string2     #Data-set2 ends

Tried this regex, but it works only one data-set is present in a file.

PATTERN\s+\d+.\d+\s+\d+.\d+.\d+.\d+\s+\d+\s+\d+\s+\d+\s+-(.*)PATTERN\s+\d+.\d+\s+\d+.\d+.\d+.\d+\s+\d+\s+\d+\s+\d+\s+.*

Scenario1: "2 complete Formats" Scenario2: 1 Complete format and 1 error/incorrect format Scenario3: None of the format created successfully

Code Generated from regex101.com

import re

regex = r"PATTERN\s+\d+.\d+\s+\d+.\d+.\d+.\d+\s+\d+\s+\d+\s+\d+\s+-(.*)PATTERN\s+\d+.\d+\s+\d+.\d+.\d+.\d+\s+\d+\s+\d+\s+\d+\s+.*"

test_str = ("PATTERN 123456789.000 10.10.10.10 1 10 0 -    \n"
    "a-12345 any random stuffs here\n"
    "c-34444 again any randomstuffs here\n"
    "PATTERN 123456789.000 10.10.10.10 1 10 9 any_string1      ")

matches = re.finditer(regex, test_str, re.MULTILINE | re.DOTALL)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
StackGuru
  • 471
  • 1
  • 9
  • 25
  • Show how you're using the regexp in the Python script. – Barmar Sep 06 '19 at 07:17
  • You have to mask the `.` like `\.`. It will match like this as well, but it will match with ANY character, not just with the dot! – csabinho Sep 06 '19 at 07:18
  • I tried that pattern which only matches one format. Above pasted the code generated from regex101.com – StackGuru Sep 06 '19 at 07:26
  • You can't use a regexp to search for things that *don't* match the pattern, unless you can define an alternate pattern that matches the errors. – Barmar Sep 06 '19 at 08:21
  • Check out this gist mate. It is not regex solution, but it is what you need (refactoring needed of course): https://gist.github.com/greedyf00x/171c91b3a32fb5278f0e5f2f49d67377 – user5214530 Sep 06 '19 at 08:38
  • I thought i can optimize it by using regex in this fashion. So, can't it be achieved ? – StackGuru Sep 06 '19 at 14:23

0 Answers0