0

I am reading a wikipedia XML file, in which i have to delete anything between curly braces. E.g. For the following string:

String text = "{{Use dmy dates|date=November 2012}} {{Infobox musical artist <!-- See Wikipedia:WikiProject_Musicians --> | name
= Russ Conway | image = | caption = Russ Conway, pictured on the front of his 1959 [[Extended play|EP]] ''More Party Pops''. | image_size = | background = non_vocal_instrumentalist | birth_name = Trevor Herbert Stanford | alias = | birth_date = {{birth date|1925|09|2|df=y}} | birth_place = [[Bristol]], [[England]], UK | death_date = {{death date and age|2000|11|16|1925|09|02|df=y}} | death_place = [[Eastbourne]], [[Sussex]], England, UK | origin = | instrument = [[Piano]] | genre = | occupation = [[Musician]] | years_active = | label = EMI (Columbia), Pye, MusicMedia, Churchill | associated_acts = | website = | notable_instruments = }}";

It should be replaced with an empty string. Notice, that the example has multiple new lines and nested {{...}}

I am using the following code:

Pattern p1 = Pattern.compile(".*\\({\\{.+\\}\\}).*", Pattern.DOTALL);
Matcher m1 = p1.matcher(text);

while(m1.find()){

String text1 = text.replaceAll(m1.group(1), "");
}

I am new to regex, can you please tell what i am doing wrong?

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Angad
  • 27
  • 6
  • 1
    You should try to find a proper parser. Java's regex is not equipped for undefined levels of nesting. What you're doing wrong is that `.+` is greedy and will match from after the first `{{` to before the last `}}`. – Jerry Oct 03 '13 at 11:36
  • You have all the tags but the programming language you're using. – devnull Oct 03 '13 at 11:38
  • @Jerry That's what i want, delete anything which is between the first '{{' and the last '}}' – Angad Oct 03 '13 at 11:45
  • @user2823318 That means that your variable text will become empty? The first characters in that string are `{{` and the last are `}}`... Well, not empty, but only `{{}}`. – Jerry Oct 03 '13 at 11:46
  • 2
    *Regex to delete parts of an XML file* - Its a bad idea. You a proper XML parser for such stuffs. – Rahul Oct 03 '13 at 11:51
  • @Jerry Yes, In fact I dont want the curly braces too. Just empty "" – Angad Oct 03 '13 at 11:51
  • @his I gave an example text. The question relates to XML markup in general. – Angad Oct 03 '13 at 12:34
  • So it hasn't to do with XML at all. It is a _very_ bad idea to show us something completely different from the stuff that causes your problems. What you show is some text with non-recursive double curly brace blocks in it which probably could be handled by reg ex. Using reg ex with XML simply would be a terrible idea. – Hauke Ingmar Schmidt Oct 03 '13 at 13:51
  • Oops, they are nesting, so regex doesn't work here also. – Hauke Ingmar Schmidt Oct 03 '13 at 14:00

1 Answers1

1

This is not generally possible with a regular expression. Regular languages cannot describe arbitrary levels of nesting, because they have no way to "count" what level they're at.

If you absolutely must use regex, you could create a cumbersome expression that would work for up to e.g. three levels of nesting, by encoding all the nesting possibilities manually. But this would be extremely cumbersome, would effectively be a violation of DRY, and is nowhere near the right tool for the job.

It would likely be easier to do this "by hand", if needs be. Scan across the string yourself, and every time you hit a {{ increase the "brace level"; every time you hit a }} decrease it. Copy each character to the output if and only if the brace level is zero.

Something like (untested):

StringBuilder output = new StringBuilder();
char[] input = text.toCharArray();
int braceLevel = 0;
for (int i = 0; i < input.length; i++) {
   final char c = input[i];
   if (c == '{') {
      // Check for {{
      if (i < input.length - 1 && input[i+1] == '{') {
         // Yep, it's a double brace - increase the level, consume
         // the second character and continue with the next char
         braceLevel++;
         i++;
         continue;
      }
   }
   else if (c == '}' && braceLevel > 0) {
      // Check for a closing brace similar to above
      if (i < input.length - 1 && input[i+1] == '}') {
         braceLevel--;
         i++;
         continue;
      }
   }

   if (braceLevel == 0) {
      output.append(c);
   }
}

// Now output contains every character that was not contained within brackets
Andrzej Doyle
  • 102,507
  • 33
  • 189
  • 228
  • This works very well. I had hoped i would not have to resort to this. Thanks for your inputs. – Angad Oct 03 '13 at 12:28