We inherited some leagcy code that has a whole lot of code copy/pasted across projects. Is there a way to find these? PMD can do a single project
6 Answers
Summary
There is also CloneDetective, Simian and Simscan. This paper from the International Conference on Software Engineering 2009 compares them, and PMD's CPD.
In detail
One tool that can handle several languages is CloneDetective (based on ConQuat, Continuous Quality Assessment Toolkit): ABAP, ADA, Java, C#, C/C++, Visual Basic, Cobol, PL1.
Another tool is Simian, the Similarity Analyser, which identifies duplication in Java, C#, C, C++, COBOL, Ruby, JSP, ASP, HTML, XML, Visual Basic, Groovy source code and even plain text files. It runs on JVM and .NET.
Actually, if you look at .NET, there are a lot of copy paste detection tools...
SimScan, the SimilarityScanner is an Eclipse/IDEA/JBUILDER plugin that finds duplicated or similar fragments of code in large Java source code bases. I don't know it, and have no idea what "similar fragments" means. It sounds like it might also just look isolatedly in single projects, but the IntelliJ-Screenshots look nifty.
This paper from the International Conference on Software Engineering 2009 compares CloneDetective, PMD's CPD, Simian and Simscan.
Just as PMD's copy & paste finder is actually called CPD for "copy paste detector", using that term as the terminus technicus for googling helps. Another term often used is "clone detection".

- 7,078
- 4
- 50
- 90
You could try using the command line version of PMD CPD:
http://pmd.sourceforge.net/cpd.html
You should be able to specify multiple source trees to check.
Simian, which is the other prominent copy/paste detector has similar command line capabilities.

- 3,383
- 30
- 41
See our Java CloneDR, a tool for finding duplicated code across large sets of code.
CloneDR finds exact and near-miss clones using the structure of the code (abstract syntax trees) as a guide, so it isn't confused by whitespace or comment changes. For detected clones, it shows you the clone instances, and a parameterized generalization that you can use as the basis of replacement abstraction (in Java, that's pretty much done by making a method; other languages have other techniques).
Another poster references a technical paper comparing clone detectors. If you examine the paper, reference number [1] is to CloneDR. The authors of that paper do not compare their detector against CloneDR, as their detector only uses tokens, not the more sophisticated method CloneDR has that uses language structure.
CloneDR works for a variety of languages: Java, C#, C++, COBOL, JavaScript, PHP, many others.
To handle multiple projects, you just tell CloneDR the set of files in all the projects.

- 93,541
- 22
- 172
- 341
If you can put those projects into one Eclipse workspace, Codepro Analytix will happily consume all of them together: https://code.google.com/javadevtools/codepro/doc/index.html

- 21,797
- 8
- 68
- 88
If you are looking for an Eclipse plugin, checkout UCDetector: Unnecessary Code Detector

- 2,824
- 2
- 31
- 44