5

I'm interested in gathering some statistics over a large corpus of Java code that I have access to. Some statistics I'd be interested in might include how often certain methods/classes are used, how often certain packages are imported, and so on.

My first thought was to use javaparser, but that library only supports up through Java 1.5, and most of the code I have is in 1.6 or greater.

Is there a library that will give me an accurate AST from some Java code (ie. can I ask javac for it somehow?), or is there a better way of approaching this problem (examining the bytecode, perhaps)?

Alex Reinking
  • 16,724
  • 5
  • 52
  • 86
  • I'd dig into pmd's [how it works](http://pmd.sourceforge.net/pmd-5.1.0/howitworks.html) and see if you can adapt that to what you are looking for. That said, it's not a small thing you're thinking about. –  Oct 23 '14 at 02:41
  • 1
    I have NO idea how to solve this, but you get a star from me cuz I would love to find out what you come up with. If you do find an answer, please post it as an answer! It would help out a LOT of people on the internet like you looking for a solution! – DreadHeadedDeveloper Oct 23 '14 at 02:58
  • 1
    @DreadHeadedDeveloper I'll be sure to post back when I figure something out. If only this were as easy as it is in Haskell... (thanks haskell-src-exts!) – Alex Reinking Oct 23 '14 at 03:03
  • 2
    Your question "How often is a certain method used" is tantamount to asking "how many places call this method?". For this, you need a Java call graph. See my answer: http://stackoverflow.com/a/26519597/120163 – Ira Baxter Oct 23 '14 at 05:10
  • How about this link http://www.programcreek.com/2012/04/represent-a-java-file-as-an-astabstract-syntax-tree/ – Sai Ye Yan Naing Aye Oct 23 '14 at 09:42

1 Answers1

0

Dunno about accurate AST, but you can certainly read the bytecodes using packages like ASM or BCEL, and scanning those data structures for function calls would be reasonably straightforward. Of course that may be after some early optimization has been performed, so it may not directly reflect the source... and it's before JIT, so it may not directly reflect what's actually running.

Another solution would be to run the code under the control of a profiler, which could give you either relative or absolute frequency of invocation from various places.

None of these would give you number of imports -- that's purely a syntactic-sugar detail. But for that same reason, I don't think that's actually a meaningful number.

keshlam
  • 7,931
  • 2
  • 19
  • 33
  • This approach might make it relatively easy to find call *sites*. The hard part of the problem is determining, what specific target methods, does a call site invoke? For that, you need a call-graph. Yes, the raw information to construct a call graph is in ASM or BCEL (just as it is in the source code, too). Extracting it isn't easy because you need to do a points-to analysis either first or simultaneously. – Ira Baxter Oct 23 '14 at 08:43
  • For invocation patterns, a good profiler may be your best bet -- since your assumptions during hand-analysis of the code may not match what actually happens when the conditionals run over real-world code, and subclassing may introduce layers/alternatives you didn't expect. – keshlam Oct 23 '14 at 13:40