4

I am trying to parse Java source to get the method names, their invocations, variable names, etc. I was looking for a pre-built or extensible module in Python and stumbled upon plyj (https://github.com/musiKk/plyj). I want to find out a method, then get the method's code and do some string processing on it based on some conditions. But I am not able to figure out its usage, the example is too vague. Can anyone point me to a good usage example?

Also, if you can let me know if antlr3 (https://github.com/antlr/antlr3) is more usable or not (with example), as I am new to these modules and do not know which one to go with. I have no performance issues, I just want to compare them based on functionalities and ease of use.

Thanks!

krish7919
  • 892
  • 2
  • 13
  • 30
  • If you want accurate information about types, you'll need a full Java name and type resolver, which antlr3 will not give you. If plyj is really just a parser (as I suspect), it won't give you that information either. This type information is hard to derive; consider the amount of Java reference manual devoted to telling what the symbols all mean. You might be able to get unqualified class and method name from a raw parse. Is that enough? (To find a method, you may already need to do full name type resolution; otherwise what does A:B:C mean? – Ira Baxter Jan 23 '14 at 04:58
  • @Ira: I do not understand you. Please elaborate. – krish7919 Jan 23 '14 at 05:06
  • OK. You want to look up the method names in class A:B:C. How exactly are you going to find out where C is, without knowing where B is defined, and processing the contents of package B to find C's declaration? It gets a lot worse with generics. – Ira Baxter Jan 23 '14 at 05:43
  • No, I am not gonna be so complex! What I want is a script which will take a .java file as input, and tell me the methods in it, get me a method's code, get me the class variable names. In other words, I can look for methods using regexes, but that will be too complex ,and I want to use one of these parsers to do it. – krish7919 Jan 23 '14 at 06:09
  • Either you are that complex, or you are willing to have some kind of heuristic (e.g., a regex or just a syntax parser) solution. – Ira Baxter Jan 23 '14 at 07:42
  • @Krish: Hi, plyj developer here. I just found this question by chance. If you still have problems, you could have a look at the tests. They don't cover everything but should be enough to give you an idea. `model.py` is very important too. Issue #18 is dedicated to your problem. – musiKk Apr 22 '14 at 12:15
  • 1
    @Krish: I added two sample programs that print some symbols of a provided source file. – musiKk Apr 23 '14 at 21:46

1 Answers1

2

If you'll settle for a hueristic solution, then get whichever one has a reliable Java parser that builds an AST (my understanding is that ANTLR is pretty good for Java), parse the source, and build custom code to crawl the tree data structure down to find the class delclaration, and crawl one layer shallower to get to the methods/members. [I don't know if PlyJ has a tested Java grammar, or builds ASTs].

For the ANTLR solution, at least, it should be pretty easy to print out the names of those. It will not be so easy to print the bodies; ANTLR has no easy way to my knowledge of printing out the subtree at a point as text. and if you could, you might find the comments have vanished, being eliminated during lexing. You might be able to extract line numbers from the tree nodes, and then go back to the original file and print out line number ranges to get method bodies. (Most parser generators even if they build ASTs do not support printing an arbitrary subtree, so I assume that pylj isn't different).

This won't handle multiple classes per file, or nested classes very well.

There are tools that can do this reliably and accurately but are more effort to put in place.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341