0

I have scraped 5000 files, stored them in individual files (0-4999.txt), now i need to find duplicate content in them. so i am comparing each file with one another in nested loop (ETA 82 hours). This approach will definitely take hours to complete. My main concern here is the no. of iterations. Can anyone suggest a better approach to cut down iterations and reduce time taken?

current code: NCD algorithm

function ncd_new($sx, $sy, $prec=0, $MAXLEN=9000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare. 
# Use $prec=-1 for result range [0-1], $pres=0 for percentual, 
# For NCD definition see http://arxiv.org/abs/0809.2553
  $x = $min = strlen(gzcompress($sx));
  $y = $max = strlen(gzcompress($sy));
  $xy= strlen(gzcompress($sx.$sy));
  $a = $sx;
  if ($x>$y) { # swap min/max
    $min = $y;
    $max = $x;
    $a = $sy;
  }
  $res = ($xy-$min)/$max; # NCD definition.
    if ($MAXLEN<0 || $xy<$MAXLEN) {
    $aa= strlen(gzcompress($a.$a));
    $ref = ($aa-$min)/$min;
    $res = $res - $ref; # correction
  }
  return ($prec<0)? $res: 100*round($res,2+$prec);
}

looping over each file:

$totalScraped = 5000;
for($fileC=0;$fileC<$totalScraped;$fileC++)
{
    $f1 = file_get_contents($fileC.".txt");
    $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
    $file1 = preg_replace($stripstr, '', $f1);

    // 0+fileC => exclude already compared files
    // eg. if fileC=10 , start loop 11 to 4999
    for($fileD=(0+$fileC);$fileD<$totalScraped;$fileD++)
    {
            $f2 = file_get_contents($fileD.".txt", FILE_USE_INCLUDE_PATH);
            $stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
            $file2 = preg_replace($stripstr, '', $f2);

            $total=ncd_new($file1,$file2);

            echo "$fileName1 vs $fileName2 is: $total%\n";
    }
}
  • Maybe PECL's [`xdiff` extension](http://php.net/manual/en/ref.xdiff.php) is worth a look. Seems to me like a valid use-case – Elias Van Ootegem Jan 20 '15 at 12:39
  • @Elias: the problem is not the matching algo, but the looping itself, but i'll also check if xdiff provide better speed over ncd_new(). thanku for the suggestion – Sachin Upmanyu Jan 20 '15 at 12:55
  • I just figured that diffing 2 files, and the processing the diff (ignoring _is_, _was_ and all that) would be somewhat more efficient than what you're doing now. Or at the very least: store the contents after `preg_replace` somewhere (in memory, or to disk), to avoid calling `preg_replace` thousands of times pointlessly – Elias Van Ootegem Jan 20 '15 at 13:04
  • yes, i have preg_replaced while scraping, so it won't be necessary here, thanks for pointing out. – Sachin Upmanyu Jan 20 '15 at 13:18

2 Answers2

0

You may want to find a way to distinguish likely candidates from unlikely ones. So, maybe there is a way that you can compute a value for each file (say: a word count, a count of sentences / paragraphs... maybe even a count of individual letters), to identify the unlikely candidates beforehand. If you could achieve this, you could reduce the amount of comparisons by ordering your arrays by this computed number.

Burki
  • 1,188
  • 19
  • 28
0

another process that i tried was:

  1. strip html tags from page
  2. replace \s{2,} with \s, \n{2,} with \n, so that text b/w each tag is presented in a single line(almost)
  3. compare two such generated files by taking a line, preg_matching, if found -> duplicate, else break line into array of words, calculate array_intersect, if count is 70% or more of line length -> duplicate.

which was very efficient and i could compare 5000 files in ~10 minutes

but still slow for my requirements.

So i implemented the first logic "ncd algo" method in C language, and it completes the task with 5-10 seconds (depending on the average page size)