9

I need to extract all string literals from a given C# file. All conditional compilation constants (e.g. #if DEBUG) are assumed to be false, and the file can be assumed to be syntactically correct. Both single-line ("a\u1000b") and verbatim (@"x""\y") literals should be supported.

First I tried to use regular expressions, but then realized that I need to correctly handle single- and multi-line comments and logical expressions in #if directives.

So, before I started to write my own C# lexer, I would like to ask you about existing solutions.

Nik Z.
  • 309
  • 2
  • 4
  • 3
    Did you look at the [Roslyn CTP](http://msdn.microsoft.com/en-us/vstudio/roslyn.aspx)? – Oded Jul 19 '13 at 16:35
  • 1
    Yes, I looked at Roslyn. But it is not in production yet, and it seems too heavyweight for my purposes. I do not need a full-fledged C# parser. I am looking for something like a page of code or so. – Nik Z. Jul 19 '13 at 16:58
  • 1
    You need more of a parser than you think. If you want to assume that all conditional compilation symbols are false, then you have to parse the code. At least parse enough of it to eliminate any code that's surrounded by `if #DEBUG`, etc. You also have to handle comments (single-line, multi-line, XML comments), and handle embedded quotes and the many weird ways that strings can be constructed. You can spend days writing that, or you can spend a few hours learning how to leverage Roslyn so that it does the work for you. – Jim Mischel Jul 19 '13 at 17:05
  • @JimMischel Agree, #if can be also embedded into each other, so your code should keep track of that hierarchy to know an #endif or #else closes exactly. Fortunately it still not as complicated as in C++ (you have #if and #ifdef, not to even mention macros - you don't have macros in C#). – Csaba Toth Jul 19 '13 at 17:15
  • Maybe you can first run the preprocessor through the code, it'd chop off the sections eliminated by directives, and you can work on the "pure" file which is the output of that procedure? – Csaba Toth Jul 19 '13 at 17:16
  • 3
    To find all string literals with no false positives you do not need a parser but you do need a lexer. I would use Roslyn. The Roslyn lexer was the first thing we got correct. – Eric Lippert Jul 19 '13 at 17:44
  • @CsabaToth: What preprocessor are you talking about? Is there some C# preprocessor that I don't know about? – Jim Mischel Jul 19 '13 at 17:51
  • @JimMischel Maybe C/C++'s preprocessor is usable for the purpose. http://msdn.microsoft.com/en-us/library/ms924239.aspx – Csaba Toth Jul 19 '13 at 18:09

0 Answers0