I thought of removing all whitespaces and linebreaks and then comparing the files. - Would that lead to false positives?
Of course it would.
public int foo() {}
public intf oo() {}
are semantically speaking entirely different beasts, but equal if whitespace is removed. However:
public int foo() {;}
public int foo() {}
are semantically speaking entirely identical. So are:
public int[] foo() {}
public int foo() [] {} // yeah this is legal java syntax.
These are not just semantically identical; most ASTs (the tree-like representation of the source code as emitted by the parser phase of ecj or javac) cannot actually differentiate between these two lines; even syntax-preserving pretty printers will always emit the first of the above 2 even if you write it in the second (admittedly, not stylistically preferred) way.
basic text analysis is never going to get you there. Java syntax is not the kind of syntax that a few regexes and replace operations is going to result in something you can reason about. You need a full parse job.
I see 2 options:
Compile the source files to class files, and compare those. Not just byte for byte, you'd want to ensure the class files contain info you do want (such as param names) but omits info you don't (such as line symbols; presumably you don't care if someone tosses a blank line in a file, but that would modify the linenumber table). But, class files are A LOT simpler to analyse than source files are.
Use ecj or the java grammar of various parser libraries out there and compare the ASTs. This is rather involved, but the only truly correct answer, in that it is by far the most flexible: You can define precisely what is and isn't relevant, for any imaginable syntax variation.
Some major problems with #1 are that there are syntactic differences that do not end up being significant in class files, so you wouldn't be able to tell them apart. That might be more a 'feature' than a 'bug', but you haven't explained why you want to compare java code, so I can't tell. It certainly closes that door: If you go down this path, you won't ever be able to detect any syntactical differences that do not end up in class files, without a complete rewrite of the project. One obvious candidate of 'code that just does not affect the class file': comments. Also, any annotations with RetentionLevel.SOURCE
. They just.. disappear, so any class file based comparison system will not be able to tell.
NB: Reducing any whitespace whose bordering characters are both java-identifier-legal to a single space, and reducing any whitespace where one or both bordering characters aren't (so, start/end of file, a parentheses, or bracket, or dash, or dot, etcetera) to nothing would at least be a better approach than a straight up 'strip all whitespace', but it wouldn't be enough for e.g. postfix syntax of array brackets []
on method signatures, blank and otherwise effect-free semicolons in between method signatures, comments, \u
escapes in strings, and a ton more things that result in different source code but which are, in pretty much all ways that I can imagine are relevant, 100% equivalent.