-3

Please give me an idea how to extract all string literals from Delphi file. There is no problem with literals surrounded by quotes. But string literals also can be presented as hash-codes. Also it can consist of parts in quote and hash-codes together. For example:

#1072#1073#1074#1075#1076', qwerty'#1072#1073#1074#1075#1076
'qwerty, '#1074#1075#1076
#1072#1073#1074#1075#1076', qwerty'
#1072#1073#1074#1075#1076', qwerty#1076'

And I need to extract all this literals I need just an idea. I'll realize it on Phyton

GolovDanil
  • 133
  • 2
  • 11
  • Use a Pascal parser. If you can't find existing code for Python write one yourself. – David Heffernan Nov 24 '16 at 08:10
  • I can write code, but I need an idea how to extract string literals. I can extract hashes or parts in quotes, but I need STRING LITERALS which may consist of different parts – GolovDanil Nov 24 '16 at 08:15
  • 1
    You need a parser. A non trivial task. Certainly far too broad for a stack overflow question. And not something that can be achieved with regex. – David Heffernan Nov 24 '16 at 08:18
  • You're not listening. The *idea how to extract string literals* is to write a parser that can go through the source code and **parse it** to identify the various parts of the code, so it can **identify** those string literals. Once you have that, you can then have other code call the parser and ask it for the string literals, and you can do whatever you want with them at that point. But you need the **parser** first. – Ken White Nov 24 '16 at 14:15
  • Use DelphiAST, see: https://github.com/RomanYankovsky/DelphiAST – Johan Nov 24 '16 at 16:18

1 Answers1

1

For your limited use case, you don't need anything as formal as a parser. Regular expressions are sufficient.

It's not hard to write a regular expression that matches conventional quoted strings: '['\r\n]*'. Likewise, it's not complicated to write an expression to match a character code, as long as you're not concerned about limiting the range of numbers matched*: #(\d+|\$[0-9A-Fa-f]+). Once you have those building blocks, you only need to put them together:

('[^\n\r']*'|#(\d+|\$[0-9A-Fa-f]+))+

That will work for most code, but it's not enough for arbitrary Delphi files. That regular expression can match inside comments. Even worse is that it may match text that appears to straddle a comment. For example:

{ 'foo{}'

That's a comment followed by a single quote, not the string literal foo{}. You can work around this by augmenting your regular expression to match comments, too. Then, as you work through your results, skip the comments.

* You shouldn't need to worry about the number range because you can expect to run your program against valid Delphi code.

Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467