1

I am trying to parse a wikitext file received through Wikipedia's API and the problem is that some of its templates (i.e. snippets enclosed in {{ and }}) are not automatically expanded into wikitext, so I have to manually look for them in the article source and replace them eventually. The question is, can I use regex in .NET to get the matches from the text ?

To try to make myself more clear, here is an example to illustrate what I mean:

For the string

{{ abc {{...}} def {{.....}} gh }}

there should be a single match, namely the entire string, so the longest possible match.

On the other hand, for "orphaned" braces such as in this example:

{{ abc {{...}}

the result should be a single match: {{...}}

Could anyone offer me a suggestion ? Thanks in advance.

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
Gabriel S.
  • 1,347
  • 11
  • 31

4 Answers4

3

You can do this with .NET regex using balancing groups definition.

The example given in the documentation shows how it works with nestable < and >. You can easily adapt the delimiters to {{ and }}. You can adapt it further to allow for single { and } within the "text" if you want.

Remember that { and } are regex metacharacters; to match literally, you can escape to \{ and \}.

polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
1

Don't do it with regex. Go through the string left to right and if you encounter a {{ push its position on a stack, and on a }} pop the position of the previous {{ from the stack and calculate the length. Then you can easily take the maximum of these length.

CodesInChaos
  • 106,488
  • 23
  • 218
  • 262
  • You're right, i tried using a stack and indeed it's a more suitable approach in this case. I am not very comfortable with regular expressions yet, but i suspect that the regex solutions would not always work as expected if there were unpaired braces in the string. – Gabriel S. Oct 14 '10 at 13:48
0

This regex pattern matches any arbitrary numbers of you mentioned pattern.

\{\{(?:[^{]+\{\{[^}]+\}\})+[^}]+\}\}

For the second request, you'll need a different regex:

\{\{.*?\}\}
Vantomex
  • 2,247
  • 5
  • 20
  • 22
0

I think you're looking at this on the wrong level. Instead of hacky regex workarounds, why not just ask the MediaWiki API to expand templates for you? You can either pass in content to be expanded:

http://www.mediawiki.org/wiki/API:Parsing_wikitext#expandtemplates

Or, better yet, ask templates in content to be pre-expanded as you download them by specifying rvexpandtemplates:

http://www.mediawiki.org/wiki/API:Query_-_Properties#revisions

lambshaanxy
  • 22,552
  • 10
  • 68
  • 92
  • Indeed, jpatokal, that would be the ideal case, let the MediaWiki engine perform all the expansions; however, there are some "exotic" situations when some of the wikitext templates in the articles are not expanded despite setting the proper parameters to do so. That's why I have to "manually" gather all the remaining unexpanded templates afterward and either process them myself or query again the WikiMedia engine, but this time to expand only those specific templates (which might prove quite expensive). Thanks anyway for your suggestions ! – Gabriel S. Nov 08 '10 at 07:58