2

I have a text file as java String. The text has the structure at below. I need to parse each section that starts with a the name "Clause". There are three clauses in this example. Therefore, after parsing I should get three strings that each one starts with a Clause and continues until it hits the next clause, but doesn't include it. The following regex gives me something like that but it has multiple flaws. First it includes the word Clause from the next section. Also it leaves out the last clause. And the worst thing is that in each iteration it repeats all the clauses:

for(int i = 0; i < clauseCount - 1; i++) {
    String p2 = "(Clause(.*)Clause)";
    Pattern pattern2 = Pattern.compile(p2, Pattern.DOTALL);
    Matcher matcher2 = pattern2.matcher(extractedText);
    if(matcher2.find()){
         System.out.println("Matched: " + matcher2.group());
    }
}

Here is the sample text with three clauses. But there are multiple files and the number of clauses are different in each file. Could you please help? I'd appreciate your feedback.

Title goes here

there is some text here:

Clause 1. In the following:

here is some text as well. The text that follows may include the name clause one or more times in the text here.

Clause 2. more text here (The text that follows may also include the name clause one or more times inside.):

(1) some text here;

(2) some text here;

(3) some text here;

Clause 3. text for new clause here. The text that follows may or may not include the name clause one or more times inside.:

(1) some text here;

(2) some text here;

(3) more some text here;

(4) some text here;

(5) and numered text can go on;

(6) and may refer to other numbers like so: (3) and (4).

Notified on (some date here)

(and here is a signature)

Javad
  • 55
  • 5
  • I don't understand what your output is. Please add what are you expecting to get from your example. – Oleg Oct 17 '17 at 02:08
  • Take a look at this to iterate your matches https://stackoverflow.com/questions/16817031/how-to-iterate-over-regex-expression – Martin Spamer Oct 17 '17 at 02:09
  • Please be aware that `Matcher.group()` is the whole matching line instead of any of the group, so it includes the word *Clause* at the beginning. – Alex Oct 17 '17 at 02:30

1 Answers1

1

One way to match from the start of a clause to the beginning of the next clause, while not consuming the start of that next clause, is to use a lookahead. Consider matching with the following pattern:

Clause\s*[0-9]+\.((?!Clause\s+[0-9]+\.).)*

This says to match Clause and a number followed by anything, one character at a time, so long as what immediately follows is not Clause followed by a number and a dot.

String input = "Clause 1. Stuff is a Clause here\nClause 2. More Clause stuff is here.";
String pattern = "Clause\\s*[0-9]+\\.((?!Clause\\s+[0-9]+\\.).)*";
Pattern r = Pattern.compile(pattern, Pattern.DOTALL);
Matcher m = r.matcher(input);

while (m.find()) {
    System.out.println("Found value: " + m.group(0));
}

Output:

Found value: Clause 1. Stuff is a Clause here
Found value: Clause 2. More Clause stuff is here.

Demo here:

Rextester

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Hi Tim, I appreciate your help. Your answer is very close but it didn't work just yet. It is entirely my fault because I neglected to mention that in the text of each clause, there maybe the name 'clause' repeated as well, such as below. Clause 1. some text here followed by the word "clause" and more text follows and then "clause" again in the same section. Then another section starts with the word "Clause". I have updated my example in the original post for better explanation. Could you please reconsider this case and advise? – Javad Oct 17 '17 at 02:42
  • Tim, this worked like a charm! Many thanks for your help!! – Javad Oct 17 '17 at 03:03