1

I have a bunch of short strings in the pattern of:

<text @varible1 more text><, @variable2 text ><@variable3 text text>

the @variableN are place holders, each bracket is a 'section' which indicates that if the enclosed varibale is an empty string, the 'section' will be absent. I'm thinking to use regular expression to extract each section and then re-assemble the whole string based on whether or not the corresponding variable is empty or not. For example, if I pass @variable1='hello' @variabl3='world' the whole string shall return as :

text hello more text, world text

At first I thought maybe I could use enough regex tricks to get the job done. Then I found the 'sections' can possibly nest and also I need to escape a few special characters such as - obviously - '<', '>' and '@'. The more I think about it the more it looks like a DSL to me. So maybe developing a scanner would be a better idea? I know only a little about writing parsers. So I'm kinda stuck, don't know which way to go.

If anybody has experience in this kind of scenario, please shed some light on it. Thanks.

Syntax Examples

 <text @varible1 more text><, @variable2 text ><@variable3 text text>
 <text @varible1 more text><, @variable2 <, @nestedVaraible> text \<@userName\> >  # with nesting and escaping
 <text @varible1 more text><, @variable2 text ><@variable3 \@twitterAccount> # escaping‘@

Shawn
  • 32,509
  • 17
  • 45
  • 74
  • Might be easier to direct you if you mention the language you're using. Someone probably already has something configurable that you can use. Definitely sound like a parser problem to me though. – fncomp Nov 14 '11 at 03:18
  • Gotcha, can you post a few lines so I can get the gist of your syntax. Probably, I'm gonna suggest adapting [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/documentation.html). – fncomp Nov 14 '11 at 03:27
  • Josh, I added a few syntax examples, hope it helps. Checked Beautiful Soup, looks xml/html specific, my case isn't particularly html, don't be confused by the < and >, they don't have anything to do with markups ;-) – Shawn Nov 14 '11 at 03:31
  • What exactly does `\<@userName\>` in the second line represent? Are the escaped brackets simply literal brackets, or is the whole thing a nested section, which might itself contain nested sections, ad infinitum? – Alan Moore Nov 14 '11 at 03:42
  • \<@userName\> means escaping < and >, sorry I just updated that example, it should be clear now. – Shawn Nov 14 '11 at 03:45

2 Answers2

2

If you're down for writing your own parser, which would be fun for this case, then I'd check out Douglas Crockford's JSLint. He posts all the code and has some really good comments.

For something a bit more general I'd definitely check out this handy SO question: Writing a simple parser.

Community
  • 1
  • 1
fncomp
  • 6,040
  • 3
  • 33
  • 42
1

You could use PHP regexes for this, but if you're open to the idea of writing a parser, I think that would be better way to invest your time. Here's the simplest regex I've come up with to match your text:

$rgx = '~((?:[^<>\\\\]++|(?:\\\\.)++)++)|(<(?:(?1)|(?-1))*+>)~';

...and all that does is divide the string into bracketed sections vs. everything else. And it only does that at one level; you have to apply it recursively to each bracketed section until you've ferreted out all the nested sections. Not to mention all the other processing you have to do, starting with finding the variable names. Regexes can be amazingly powerful, but even more amazing is the amount of work you have left to do after all the brain sweat you put into creating the regex.

Python's regexes aren't nearly as powerful, which is probably a good thing, frustrating though it is to regex junkies like me. :P What it has instead is pyparsing. I've never used it myself, but keep hearing good things about it. It might be just what you need.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156