15

I have a single PHP file within a legacy project that is at least a few thousand lines long. It is predominantly separated up into a number of different conditional blocks by a switch statement with about 10 cases. Within each case there is what appears to be a very similar - if not exact duplicate - block of code. What methods are available for me identifying these blocks of code as being either the same - or close to the same - so I can abstract that code out and begin to refactor the entire file? I know this is possible in very manual terms (separate each case statement in the code into individual files and Diff) but i'm interested in what tools i could be using to speed this process up.

Thanks.

skaffman
  • 398,947
  • 96
  • 818
  • 769
robjmills
  • 18,438
  • 15
  • 77
  • 121

4 Answers4

14

You can use phpcpd.

phpcpd is a Copy/Paste Detector (CPD) for PHP code. It scans a PHP project for duplicated code.

Further resources:

Gordon
  • 312,688
  • 75
  • 539
  • 559
  • 1
    that looks like a great starting point and a really handy tool. thanks – robjmills Sep 23 '10 at 13:47
  • 1
    Sadly, this only detects duplicate PHP statements - I have project with thousands of lines of duplicate HTML in PHP templates, and this tool only actually detects a pretty small number of those lines. – mindplay.dk May 15 '19 at 10:51
4

You can use phpunit PMD (Project Mess Detector) to detect duplicated blocks of code.

It also can compute the Cyclomatic complexity of your code.

Here is a screenshot of the pmd tab in phpuc: pmd tab

greg0ire
  • 22,714
  • 16
  • 72
  • 101
  • Cyclomatic Complexity has nothing to do with Copy and Pasted code. And looking at the docs for [PMD](http://phpmd.org/rules/index.html), I'd say it cannot detect such duplicate code. It is without a doubt a good tool though. – Gordon Sep 23 '10 at 13:35
  • I updated my post, I think it is clearer now. I also think phpunit-pmd uses phpcpd, doesn't it? Or is it another implementation? – greg0ire Sep 23 '10 at 13:38
  • I might have been confused by the tab label in this (great) UI, which might call several tools. – greg0ire Sep 23 '10 at 13:50
  • 1
    it definitely does. but checkout [hudson](http://www.whitewashing.de/blog/126) and [arbit](http://www.arbitracker.org/news.html) for alternatives. – Gordon Sep 23 '10 at 13:55
  • Thanks for these precisions. Adding this post to my favorites :-) – greg0ire Sep 23 '10 at 14:12
2

See our PHP Clone Detector tool.

This finds both exact copies and near misses, in spite of reformatting, insertion/deletion of comments, replacement of variable names, addition/replacments of subblocks etc.

PHPCPD as far as I can tell finds only (token) sequences which are exactly the same. That misses a lot of clones, since the most common operation after copy-paste is edit-to-customize. So it would miss the very clones the OP is trying to find.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 1
    Stop spreading FUD. phpcpd compares without taking whitespace into account. – cweiske Apr 30 '11 at 12:00
  • @cweiske: That means it only finds token sequences that are exactly the same, which is what I said. It won't find parameterized clones, which are those where eh code has been copy-paste-edited. It may find *pieces* of such clones, but that's a lot less helpful. – Ira Baxter Apr 30 '11 at 13:19
  • @cweiske: Have you examined the Joomla report shown at the website? It shows the parameterized clones I'm talking about. Run PHPCPD on it, and compare the results. I think you'll be surprised. – Ira Baxter Apr 30 '11 at 13:38
  • @cweiske: FWIW, the github site for PHPCPD https://github.com/sebastianbergmann/phpcpd shows an example run on 60,000 lines of code, where it finds only 0.2% clones ("exact matches") That's frankly a pathetically small number of clones based on my decade of building/running clone detectors for many langauges; most code of any scale has 5-20% or more The difference has to do with detecting parameterized clones. You can down load CloneDR and try it yourself. – Ira Baxter Apr 30 '11 at 14:11
0

You could put the blocks in separate files and just run diff on them?

However, I think in the end you will need to go through everything manually anyway, since it sounds like this code requires a lot of refactoring, and even if there are differences you will probably need to evaluate whether this is intentional or a bug.

mikera
  • 105,238
  • 25
  • 256
  • 415