How to parse actual code like stackoverflow/intellisense/etc?

Question

I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I've noticed it's even sophisticated enough to understand the relationships between everything it parses, like so:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

Many IDEs do this also. How is this done?

Edit: Further explaination - I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?

The SO highlighter is often wrong (especially in languages which are not called "C#"). Since you don't have the opportunity to tell it what language you're using, it probably just has some heuristics for what's commonly a comment/string/etc. in many common languages. It's probably a lot less sophisticated than, say, Intellisense (which itself isn't the end-all-be-all of parsers). — Ken, Aug 19 '10 at 00:00
If I had input from a user on what language to parse, how then would I attempt this? Is there an easier method then? — stupidkid, Aug 19 '10 at 00:23
@stupidkid: See my answer. What you're asking for is the first step in the compilation process. It's not trivial to do right. — Borealid, Aug 19 '10 at 00:26
Knowing what language is used is important but making a guess on the language is much easier than the parsing. All but the simplest languages will present a challenge to parse. Once you're parsed the code you're over the hump from a highlighting point of view, but it's a steep hurdle. I'm also confused by your comment, on one hand you say you're past the parsing part but later you say you want to parse accurately. Good luck. — Paul Rubel, Aug 19 '10 at 01:11
@paulruben - Thanks for your answer. My comment is probably confusing to you because I am using the term 'parsing' as its most generic meanings - simply absorbing the code into memory, not going as far as to attach meaning to certain words and sentences. — stupidkid, Aug 19 '10 at 03:26

Paul Rubel · Accepted Answer · 2010-08-19T01:04:48.523

To really have your IDE/compiler/interpreter "understand" and colorize code you'll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, "Compilers: Principles, Techniques, and Tools." You can see some of the difficulty in constructs like this

i+++++i;

or

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn't even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn't have C++ refactoring support. For highlighting a few mistakes are probably OK. When you're refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

+1 from me, too, for providing a direct link to the Dragon Book. — Android Eve, Aug 19 '10 at 01:08

score 2 · Answer 2 · answered Aug 18 '10 at 23:56

2

In order to correctly highlight a language, you have to build a parse tree. This requires first tokenizing the string, and then either performing a top-down or a bottom-up parse. Afterwards, something walks the tree and highlights the portions of the original string corresponding to nodes of a certain sort.

To really understand this, you're going to have to read a book on compiler design/programming language fundamentals. The relevant topics are tokenizers, parsing, and grammars.

answered Aug 18 '10 at 23:56

Borealid

95,191
9
106
122

3

"Any college course on X" is a good start for a *lot* of questions here, but it's not typically a very helpful answer since the askers aren't often in a position to take such a course. If they were, they could just go ask the professor instead of hoping us random geeks on the internet will feel like answering them. – Ken Aug 19 '10 at 00:02
Haha well said Ken. I would love to have the opportunity to take a college course on compilers but that is not possible for my situation. @Borealid - I am familiar with parse trees, my question is, once I am past that part.. is there something like a universal XML Scheme, or code structuring hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or am I to become a syntax master for every language I wish to parse accurately? – stupidkid Aug 19 '10 at 00:30
@stupidkid: Tokenization is dealing with a language's *syntax*. Parsing is dealing with its *semantics*. XML represents a universal syntax. There will never be and cannot be universal semantics - they are meaning. What a "logic operator" is in one language is dependent on the language. So yes, you have to build a different parser for each language you want to understand. Take a look, however, at "parser generators" like Bison. You feed them an abstract description of the language's grammar, and they spit out C source for a parser. – Borealid Aug 19 '10 at 00:36

How to parse actual code like stackoverflow/intellisense/etc?

2 Answers2