How to parse marked up text in C#

Question

I am trying to make a simple text formatter using MigraDoc for actually typesetting the text. I'd like to specify formatting by marking up the text. For example, the input might look something like this:

"The \i{quick} brown fox jumps over the lazy dog^{note}"

which would denote "quick" being italicized and "note" being superscript. To make the splits I have made a dictionary in my TextFormatter:

internal static TextFormatter()
    {
        FormatDictionary = new Dictionary<string, TextFormats>()            
        {
            {@"^", TextFormats.supersript},
            {@"_",TextFormats.subscript},
            {@"\i", TextFormats.italic}
        };
    }

I'm then hoping to split using some regexes that looks for the modifier strings and matches what is enclosed in braces.

But as multiple formats can exist in a string, I need to also keep track of which regex was matched. E.g. getting a List<string, TextFormats>, (where string is the enclosed string, TextFormats is the TextFormats value corresponding to the appropriate special sequence and the items are sorted in order of appearance), which I could then iterate over applying formatting based on the TextFormats.

Thank you for any suggestions.

does nested formatting need to be supported? so superscript and italic for example? also what have you tried, looks like you've started but you haven't actually tried to implement it. — Eluvatar, Nov 18 '13 at 17:01
What about pre-processing the text and pulling out each top level format sequence as a token? ie: "The \b{\i{quick}} brown fox jumps over the lazy dog^{note}". In this example, you would have two top level text format tokens. You can then use a stack to break apart each token into a series of composite tokens; ie: the first text format element token in the example would be broken into two tokens on the stack (bold and then italic); the second would only have one. You can then pop items off the stack tokens to apply the format in the logical nested order in which they appeared in the text. — codechurn, Nov 18 '13 at 17:05
@Eluvatar Nested formatting is a nice-, but not need-to-have. You are correct that I have not tried to implement it. I did spend some time thinking it over, but as it was getting late *violins start playing* I had the choice of writing a question, or starting something half-heartedly that I will not have time to work on again until next week. Since mark-up is fairly widespread, I thought I might have just missed an easy basic implementation. — AdamAL, Nov 18 '13 at 19:39
@codechurn I don't think I can quite follow you, but it sounds clever. Care to elaborate in an Answer? — AdamAL, Nov 18 '13 at 19:47

score 1 · Accepted Answer · edited Nov 18 '13 at 17:18

1

Consider the following Code...

string inputMessage = @"The \i{quick} brown fox jumps over the lazy dog^{note}";
MatchCollection matches = Regex.Matches(inputMessage, @"(?<=(\\i|_|\^)\{)\w*(?=\})");

foreach (Match match in matches)
{
    string textformat = match.Groups[1].Value;
    string enclosedstring = match.Value;
    // Add to Dictionary<string, TextFormats> 
}

Good Luck!

edited Nov 18 '13 at 17:18

Kuba hasn't forgotten Monica

95,931
16
151
313

answered Nov 18 '13 at 17:11

gpmurthy

2,397
19
21

That's awesome. I had no idea you could catch a group in lookbehind - cool. – AdamAL Nov 25 '13 at 11:54
However it doesn't quite cut it, since the non-formatted parts are not captured. I therefore tried making the look(behind|ahead) optional: @"(?<=(\\i|_|\^)\{)?\w*(?=\})?", but then it matched way to much. I ended up using something similar to yours. And then split the original string using all the match.Value's. I'll update the question with the solution I ended up with, when I get around to it. – AdamAL Nov 25 '13 at 12:04

score 0 · Answer 2 · answered Nov 18 '13 at 17:51

I'm not sure if callbacks are available in Dot-Net, but

If you have strings like "The \i{quick} brown fox jumps over the lazy dog^{note}" and
you want to just do the substitution as you find them.
Could use regex replace using a callback

 #  @"(\\i|_|\^){([^}]*)}"

 ( \\i | _ | \^ )         # (1)
 {
 ( [^}]* )                # (2)
 }

then in callback examine capture buffer 1 for format, replace with {fmtCodeStart}\2{fmtCodeEnd}

or you could use

 #  @"(?:(\\i)|(_)|(\^)){([^}]*)}"

 (?:
      ( \\i )             # (1)
   |  ( _ )               # (2)
   |  ( \^ )              # (3)
 )
 {
 ( [^}]* )                # (4)
 }

then in callback

 if (match.Groups[1].sucess) 
   // return "{fmtCode1Start}\4{fmtCode1End}"
 else if (match.Groups[2].sucess) 
   // return "{fmtCode2Start}\4{fmtCode2End}"
 else if (match.Groups[3].sucess) 
   // return "{fmtCode3Start}\4{fmtCode3End}"

How to parse marked up text in C#

2 Answers2