I have a text file as java String. The text has the structure at below. I need to parse each section that starts with a the name "Clause". There are three clauses in this example. Therefore, after parsing I should get three strings that each one starts with a Clause and continues until it hits the next clause, but doesn't include it. The following regex gives me something like that but it has multiple flaws. First it includes the word Clause from the next section. Also it leaves out the last clause. And the worst thing is that in each iteration it repeats all the clauses:
for(int i = 0; i < clauseCount - 1; i++) {
String p2 = "(Clause(.*)Clause)";
Pattern pattern2 = Pattern.compile(p2, Pattern.DOTALL);
Matcher matcher2 = pattern2.matcher(extractedText);
if(matcher2.find()){
System.out.println("Matched: " + matcher2.group());
}
}
Here is the sample text with three clauses. But there are multiple files and the number of clauses are different in each file. Could you please help? I'd appreciate your feedback.
Title goes here
there is some text here:
Clause 1. In the following:
here is some text as well. The text that follows may include the name clause one or more times in the text here.
Clause 2. more text here (The text that follows may also include the name clause one or more times inside.):
(1) some text here;
(2) some text here;
(3) some text here;
Clause 3. text for new clause here. The text that follows may or may not include the name clause one or more times inside.:
(1) some text here;
(2) some text here;
(3) more some text here;
(4) some text here;
(5) and numered text can go on;
(6) and may refer to other numbers like so: (3) and (4).
Notified on (some date here)
(and here is a signature)