68

How do I go about writing a Parser (Recursive Descent?) in C#? For now I just want a simple parser that parses arithmetic expressions (and reads variables?). Though later I intend to write an xml and html parser (for learning purposes). I am doing this because of the wide range of stuff in which parsers are useful: Web development, Programming Language Interpreters, Inhouse Tools, Gaming Engines, Map and Tile Editors, etc. So what is the basic theory of writing parsers and how do I implement one in C#? Is C# the right language for parsers (I once wrote a simple arithmetic parser in C++ and it was efficient. Will JIT compilation prove equally good?). Any helpful resources and articles. And best of all, code examples (or links to code examples).

Note: Out of curiosity, has anyone answering this question ever implemented a parser in C#?

Kirill Kobelev
  • 10,252
  • 6
  • 30
  • 51
ApprenticeHacker
  • 21,351
  • 27
  • 103
  • 153
  • 3
    Do you want to implement the parser yourself, or do you want to use a library that takes a grammar and then creates the parser by itself? – CodesInChaos Sep 11 '11 at 09:31
  • 6
    [ANTLR](http://antlr.org/) is a stable in the .NET community for parsing custom languages, though it voids your learning experience. – Richard Szalay Sep 11 '11 at 09:31
  • 1
    For an arithmetic expression parser I'd personally lean towards shunting-yard. Re "has anyone" - I've done a few highly specialized parsers, but I don't know about the availability of the more general-purpose parser generators for C# – Marc Gravell Sep 11 '11 at 09:32
  • 1
    @CodeInChaos I want to implement it myself. – ApprenticeHacker Sep 11 '11 at 11:41
  • 3
    I wrote my first "real" parser after reading this article (http://www.cs.nott.ac.uk/~gmh/monparsing.pdf). It's meant for functional languages but it should give some insight how to design composable parsers. – Just another metaprogrammer Sep 11 '11 at 13:27
  • 3
    @TomTom, you're wrong. There are so very different idiomatic approaches for different languages. You can't write a parser the same way in Fortran and, say, Haskell. In C# you can use, say, combinators, just like in the real programming languages, and it can be a sensible approach for some grammars. – SK-logic Sep 11 '11 at 19:27
  • @TomTom SK-logic is right. You can't write an OOP parser the same way you write a functional parser. And you have to adapt and co-operate with the features of the language. E.g if I was using C, I would almost surely have used pointers, whereas in C# I don't need to. – ApprenticeHacker Sep 12 '11 at 04:04

7 Answers7

92

I have implemented several parsers in C# - hand-written and tool generated.

A very good introductory tutorial on parsing in general is Let's Build a Compiler - it demonstrates how to build a recursive descent parser; and the concepts are easily translated from his language (I think it was Pascal) to C# for any competent developer. This will teach you how a recursive descent parser works, but it is completely impractical to write a full programming language parser by hand.

You should look into some tools to generate the code for you - if you are determined to write a classical recursive descent parser (TinyPG, Coco/R, Irony). Keep in mind that there are other ways to write parsers now, that usually perform better - and have easier definitions (e.g. TDOP parsing or Monadic Parsing).

On the topic of whether C# is up for the task - C# has some of the best text libraries out there. A lot of the parsers today (in other languages) have an obscene amount of code to deal with Unicode etc. I won't comment too much on JITted code because it can get quite religious - however you should be just fine. IronJS is a good example of a parser/runtime on the CLR (even though its written in F#) and its performance is just shy of Google V8.

Side Note: Markup parsers are completely different beasts when compared to language parsers - they are, in the majority of the cases, written by hand - and at the scanner/parser level very simple; they are not usually recursive descent - and especially in the case of XML it is better if you don't write a recursive descent parser (to avoid stack overflows, and because a 'flat' parser can be used in SAX/push mode).

GolezTrol
  • 114,394
  • 18
  • 182
  • 210
Jonathan Dickinson
  • 9,050
  • 1
  • 37
  • 60
  • Thanks. (+1 though you deserve +10). Another thing I would like to confirm is, if I ever do succeed in making an actual programming language interpreter in C# (some simple language), will that programming language be considered a .NET compatible language (like iron-python or boo) or just a language made with C#? – ApprenticeHacker Sep 11 '11 at 11:40
  • 2
    @IntermediateHacker it depends on whether you emit MSIL. A lot of the '.Net languages' started their lifetime in C#, and eventually got rewritten in themselves (this is called bootstrapping). If you make an interpreter it will be a 'C# language' as such (but will/can still be used from other .Net languages). – Jonathan Dickinson Sep 11 '11 at 11:44
  • Lol you are right not to comment about JITed code. I don't want this question to start the old "which is faster" war. – ApprenticeHacker Sep 11 '11 at 11:46
  • 2
    Jonathan mentioned a lot of good framework but I like to add: http://www.quanttec.com/fparsec/ . It's meant for F# but it has good configuration, good performance and produce human readable parser error messages OOB. – Just another metaprogrammer Sep 11 '11 at 13:30
  • 1
    Working link using WaybackMachine [Let's Build a Compiler](http://web.archive.org/web/20131226083407/http://compilers.iecc.com/crenshaw/) – Measurity Jan 14 '14 at 21:54
  • Have a look at this, parsec c# lib: https://github.com/linerlock/parseq – FrankyHollywood Aug 24 '17 at 11:57
20

Sprache is a powerful yet lightweight framework for writing parsers in .NET. There is also a Sprache NuGet package. To give you an idea of the framework here is one of the samples that can parse a simple arithmetic expression into an .NET expression tree. Pretty amazing I would say.

using System;
using System.Linq.Expressions;
using Sprache;

namespace LinqyCalculator
{
    static class ExpressionParser
    {
        public static Expression<Func<decimal>> ParseExpression(string text)
        {
            return Lambda.Parse(text);
        }

        static Parser<ExpressionType> Operator(string op, ExpressionType opType)
        {
            return Parse.String(op).Token().Return(opType);
        }

        static readonly Parser<ExpressionType> Add = Operator("+", ExpressionType.AddChecked);
        static readonly Parser<ExpressionType> Subtract = Operator("-", ExpressionType.SubtractChecked);
        static readonly Parser<ExpressionType> Multiply = Operator("*", ExpressionType.MultiplyChecked);
        static readonly Parser<ExpressionType> Divide = Operator("/", ExpressionType.Divide);

        static readonly Parser<Expression> Constant =
            (from d in Parse.Decimal.Token()
             select (Expression)Expression.Constant(decimal.Parse(d))).Named("number");

        static readonly Parser<Expression> Factor =
            ((from lparen in Parse.Char('(')
              from expr in Parse.Ref(() => Expr)
              from rparen in Parse.Char(')')
              select expr).Named("expression")
             .XOr(Constant)).Token();

        static readonly Parser<Expression> Term = Parse.ChainOperator(Multiply.Or(Divide), Factor, Expression.MakeBinary);

        static readonly Parser<Expression> Expr = Parse.ChainOperator(Add.Or(Subtract), Term, Expression.MakeBinary);

        static readonly Parser<Expression<Func<decimal>>> Lambda =
            Expr.End().Select(body => Expression.Lambda<Func<decimal>>(body));
    }
}
Juliano Sales
  • 1,021
  • 1
  • 7
  • 12
Martin Liversage
  • 104,481
  • 22
  • 209
  • 256
4

C# is almost a decent functional language, so it is not such a big deal to implement something like Parsec in it. Here is one of the examples of how to do it: http://jparsec.codehaus.org/NParsec+Tutorial

It is also possible to implement a combinator-based Packrat, in a very similar way, but this time keeping a global parsing state somewhere instead of doing a pure functional stuff. In my (very basic and ad hoc) implementation it was reasonably fast, but of course a code generator like this must perform better.

SK-logic
  • 9,605
  • 1
  • 23
  • 35
3

I know that I am a little late, but I just published a parser/grammar/AST generator library named Ve Parser. you can find it at http://veparser.codeplex.com or add to your project by typing 'Install-Package veparser' in Package Manager Console. This library is kind of Recursive Descent Parser that is intended to be easy to use and flexible. As its source is available to you, you can learn from its source codes. I hope it helps.

000
  • 807
  • 7
  • 14
1

In my opinion, there is a better way to implement parsers than the traditional methods that results in simpler and easier to understand code, and especially makes it easier to extend whatever language you are parsing by just plugging in a new class in a very object-oriented way. One article of a larger series that I wrote focuses on this parsing method, and full source code is included for a C# 2.0 parser: http://www.codeproject.com/Articles/492466/Object-Oriented-Parsing-Breaking-With-Tradition-Pa

Ken Beckett
  • 1,273
  • 11
  • 13
0

For the record I implemented parser generator in C# just because I couldn't find any working properly or similar to YACC (see: http://sourceforge.net/projects/naivelangtools/).

However after some experience with ANTLR I decided to go with LALR instead of LL. I know that theoretically LL is easier to implement (generator or parser) but I simply cannot live with stack of expressions just to express priorities of operators (like * goes before + in "2+5*3"). In LL you say that mult_expr is embedded inside add_expr which does not seem natural for me.

greenoldman
  • 16,895
  • 26
  • 119
  • 185
0

Well... where to start with this one....

First off, writing a parser, well that's a very broad statement especially with the question your asking.

Your opening statement was that you wanted a simple arithmatic "parser" , well technically that's not a parser, it's a lexical analyzer, similar to what you may use for creating a new language. ( http://en.wikipedia.org/wiki/Lexical_analysis ) I understand however exactly where the confusion of them being the same thing may come from. It's important to note, that Lexical analysis is ALSO what you'll want to understand if your going to write language/script parsers too, this is strictly not parsing because you are interpreting the instructions as opposed to making use of them.

Back to the parsing question....

This is what you'll be doing if your taking a rigidly defined file structure to extract information from it.

In general you really don't have to write a parser for XML / HTML, beacuse there are already a ton of them around, and more so if your parsing XML produced by the .NET run time, then you don't even need to parse, you just need to "serialise" and "de-serialise".

In the interests of learning however, parsing XML (Or anything similar like html) is very straight forward in most cases.

if we start with the following XML:

    <movies>
      <movie id="1">
        <name>Tron</name>
      </movie>
      <movie id="2">
        <name>Tron Legacy</name>
      </movie>
    <movies>

we can load the data into an XElement as follows:

    XElement myXML = XElement.Load("mymovies.xml");

you can then get at the 'movies' root element using 'myXML.Root'

MOre interesting however, you can use Linq easily to get the nested tags:

    var myElements = from p in myXML.Root.Elements("movie")
                     select p;

Will give you a var of XElements each containing one '...' which you can get at using somthing like:

    foreach(var v in myElements)
    {
      Console.WriteLine(string.Format("ID {0} = {1}",(int)v.Attributes["id"],(string)v.Element("movie"));
    }

For anything else other than XML like data structures, then I'm afraid your going to have to start learning the art of regular expressions, a tool like "Regular Expression Coach" will help you imensly ( http://weitz.de/regex-coach/ ) or one of the more uptodate similar tools.

You'll also need to become familiar with the .NET regular expression objects, ( http://www.codeproject.com/KB/dotnet/regextutorial.aspx ) should give you a good head start.

Once you know how your reg-ex stuff works then in most cases it's a simple case case of reading in the files one line at a time and making sense of them using which ever method you feel comfortable with.

A good free source of file formats for almost anything you can imagine can be found at ( http://www.wotsit.org/ )

shawty
  • 5,729
  • 2
  • 37
  • 71
  • 1
    While I agree that he should use the built-in XML parsing in .Net - he does want to write the parser as an academic excersise. – Jonathan Dickinson Sep 11 '11 at 10:04
  • Exactly why, I showed how to use the raw XElement objects also. – shawty Sep 11 '11 at 10:08
  • on a side note, there are 6 or 7 different API's in C# and .NET that could be used, i don't have books to hand to list them all however, but after XDoc/XElement I believe are the XPath ones. – shawty Sep 11 '11 at 10:09
  • @shawty XElement is part of an XML parser; to create a parser, he needs to take an XML stringq and turn it into something usable – Will03uk Sep 11 '11 at 10:41
  • I see your point, but I disagree, it's part of the language which will allow you to write your own parser. Actual parsing in .NET is where a document fragment is completely consumed such as when using the RSS/Atom feed objects, of some other de-serialisation. XElement, is a tool to help you write your parser, you still have to instruct it how to and in what order to parse the contents. Likewise regex is a tool to split apart text, and in the same manner 'string.split' is too, all tools but not a complete parser that deals with a complete file. – shawty Sep 11 '11 at 10:49
  • 4
    Maybe it is a terminology thing, but in the case of XML I would consider the XmlReader to be the "parser" here... Not my specialist subject, though. – Marc Gravell Sep 11 '11 at 11:15