I've been looking at writing a Textile parser using Scala's parser combinator library (basically a PEG parser), and was wondering what kind of approach I should use for parsing the inline modifiers
This is *bold* text, _italic_ text, +underlined+ text, etc.
in this case it's pretty clear what's what, and what should be parsed. However, there are a large number of edge cases where it's not so clear. Focusing only on bold text:
Which sections get bolded:
*onomato*poeia* ?
bold *word*, without a space after?
tyr*annos*aurus
a bold word in a (*bracket*)?
How about *This *case?
Obviously this is a mix of subjective (which things should count as bold) and objective (how to make the parsing rules parse it correctly).
I'm leaning towards a PEG something like
wordChar = [a-zA-Z]
nonWordChar = [^a-zA-Z]
boldStart = nonWordChar ~ * ~ wordChar
boldEnd = wordChar ~ * ~ nonWordChar
boldSection = boldStart ~ rep(not(boldEnd) ~ anyChar) ~ boldEnd
Which would parse the above as follows:
<b>onomato*poeia</b> ?
bold <b>word</b>, without a space after?
tyr*annos*aurus <- fails because of lack of whitespace
a bold word in a (<b>bracket</b>)?
How about *This *case? <- fails because there is no correct closing *
However I'm not sure if this method holds for all use cases and is well defined for all edge cases. Is there a standard way of doing this which I can copy and rely on? I'd rather not rely on my ad-hoc not-well-thought-through language spec if I can avoid it.