Parsing a Programming Language and Identifying Components of it

Question

I'm looking for steps/libraries/approaches to solve this Problem statement.

Given a source file of a Programming language, I need to parse it and Subdivide it into components.

Example: Given a Java File, I need to find the following in it.

list of Imports
Classes present in it
Attributes in the Class
Methods in it - along the Parameters if any. etc.

I need to extract these and store it separately. Reason Why I want to do it?

I want to build an Inverted Index on the top of these Components.

Example queries to Inverted index 1. Find the list of files with Class name: Sample 2. Find the positions where variable XXX is used within the class AAA.

I need to support queries likes the above

So, my plan is given a file, if I build these components from it, It would be easy to build an Inverted index on the top of it.

Example: Sample -- Class - Sample.java(Keyword - Component - FileName ) I want to build an Inverted index like above.

I see it is being implemented in many IDEs like IntelliJ.What I'm interested it how much effort it would take to build something like this. And I want to try implementing the same for at least one language.

Thanks in advance.

score 0 · Answer 1 · answered Jul 15 '16 at 19:58

You can try to do this "just" a parser; for your specific example, that might be enough.

But you'll need a parser for each language. If you stick to just Java, you can find Java parsers pretty easily; just reuse one, there is little point in you reinventing one more set of grammar rules to describe Java.

For more than one language, this starts to get tricky. You can:

try to find a separate parser for each language. This may be sort of successful for mainstream languages. As you get to less well known languages, these get a lot harder to find. If you succeed, you'll have the problem that the parsers are likely incompatible technology; now gluing them together to collectively collect your index information is going to be a mess.
pick one parsing technology and get grammars for all the languages you care about. You have only two realistic choices: YACC/Bison, and ANTLR. As a practical matter the YACC and Bison have been used to implement LOTS of languages... but the grammar files are not collected in one place, so they are hard to find. ANTLR at least has a single repository you can find at their web site. So that might kind of work.

Its going to be quite the effort to assemble all these into an integrated whole.

A complication is that you may want more than just raw syntax; you might want to know the meaning of the symbols, and for each symbol, precisely where it is defined in which file. After all, you want your index to be accurate at scale, and this will require differentiating foo the variable name from foo the function name. Arguably you need symbol tables. As a general rule, this is where pure-parsing of languages breaks down; there is serious Life After Parsing.

In that case, you want an integrated set of tools for extracting information from the different languages.

Our DMS Software Reengineering Toolkit is such a framework, and has some 40 languages predefined for it. We use something like OP's suggested process to build indexes of a code base for search tools based on DMS. Building something like DMS is an enormous effort.

Parsing a Programming Language and Identifying Components of it

1 Answers1