2

I'm working with a C++ code base. Right now I'm using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the program.

The strings in question are always preceded by a JUCE macro called TRANS. Here are some examples which should extract a string

TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")

And I'm sure you can imagine some other possible string varients that could occur in a large code base. I'm making an automatic tool to generate JUCE translation formatted files to automate the process as much as possible

I've gotten this far, as it stands, for pattern matching in order to find these strings. I've converted the source code into a lua string

path = ...

--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")

and called

for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end

which finds a pattern that starts with TRANS, has balanced parenthesis. This will get me the full Macro, including the brackets but from there I figured it would be pretty easy to split off the fat I don't need and just keep the actual string value.

However this doesn't work for strings which cause a parenthesis imbalance. e.gTRANS(")") will return TRANS("), instead of TRANS("(")

I revised my pattern to

for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end

where, the pattern should start with a TRANS, then 0 or many spaces. Then it should have a ( character followed by zero or more spaces. Now that we are inside the brackets, we should have a balanced number of "" marks, followed by another 0 or many spaces, and finally ended by a ) . Unfortunately, this does not return a single value when used. But... I think even IF it worked as I expected it to... There can be a \" inside, which causes the bracket imbalance.

Any advice on extracting these strings? Should I continue to try and find a pattern matching sequence? or should I try a direct algorithm... Do you know why my second pattern returned no strings? Any other advice! I'm not looking to cover 100% of all possibilities, but being close to 100% would be awesome. Thanks! :D

Colton Phillips
  • 237
  • 1
  • 4
  • 13

2 Answers2

1

I love Lua patterns as much as anyone, but you're bringing a knife to a gun fight. This is one of those problems where you really don't want to code the solution as regular expressions. To deal correctly with doublequote marks and backslash escapes, you want a real parser, and LPEG will manage your needs nicely.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
  • How will LPEG give me an advantage over typical lua patterns? There is a lot to read, can you give me some specific functions that might be of some interest? Also, are you sure patterns won't suffice? I don't need to cover 100% of all boundary conditions I believe... It's not important for the use of the tool I'm working on. Right now I'm wondering if it might be a better idea to try accomplishing this in steps rather than a single pattern... – Colton Phillips Jul 05 '11 at 17:00
  • LPEG is a complete parser system. You would write a grammar that matches your TRANS macro instances, the balanced parenthesis, and the full syntax of a C string literal including the balanced `"` marks, backslash escapes, and implicit concatenations. LPEG would then allow you to apply the grammer to your source text, and pull out what you need to know. Its a big package, but well worth it. Think about it as giving you everything that LEX and YACC provide and more. – RBerteig Jul 07 '11 at 00:44
0

In the second case, you forgot to escape parentheses. Try

for word in string.gmatch(str, 'TRANS%s*%(%s*(%b"")%s*%)') do print(word) end
lhf
  • 70,581
  • 9
  • 108
  • 149
  • Ahh, I didn't know you needed an escape character for ( marks. And I like the way your answer gives the string in an even better format that I had before... Unfortunately this still fails for strings such as TRANS("\"") or any string with the quotation in it... – Colton Phillips Jul 05 '11 at 01:31
  • @Colton, I don't think you'll be able to 100% or even near it with Lua patterns. You need a real lexer that understands C string rules. I suggest using the C preprocessor (`cpp` or `gcc -E`) with a definition of `TRANS` that makes it easier to postprocess the result. I did a quick test with `#define TRANS(x) BEGIN x END` and it works fine on your examples. – lhf Jul 05 '11 at 01:44
  • I'm not exactly sure what you mean by your response. I'm aware of what a C preprocessor macro is, as that is what TRANS is, but I don't know what you are getting at with this quick test #define. Could you go into greater detail? :) – Colton Phillips Jul 05 '11 at 17:02
  • @Colton, I was proposing a totally different, batch approach to the problem, but I guess that it does not fit your requirement of doing this embedded in a C++ app. – lhf Jul 05 '11 at 17:56
  • Ahh... Yes, this C++ app is based on this open source project I made called FileDigger. http://code.google.com/p/file-digger/source/browse/#svn%2Ftrunk%2FFileDigger Since I already had this nifty interface for recursive operations on files I figured it would be a pretty easy task to do the parsing in Lua, and a good learning exercise. – Colton Phillips Jul 05 '11 at 18:33