1

I have to files, one is full of keywords sequences (~20k lines), the other is full of regular expression (~2.5k).

I want to test each keyword with each regexp and print the one that matches. I tested my files and that makes around 22 750 000 tests. I am using the following code :

$count = 0;
$countM = 0;
foreach ($arrayRegexp as $r) {
    foreach ($arrayKeywords as $k) {
        $count++;
        if (preg_match($r, $k, $match) {
            $countM++;
            echo $k.' matched with keywords '.$match[1].'<br/>';
        }
    }
}
echo "$count tests with $countM matches.";

Unfortunately, after computing for a while, only parts of the actual matches are displayed and the final line keeping the counts never displays. What is even more weird is that if I comment the preg section to keep only the two foreach and the count display, everything works fine.

I believe this is due to an excessive amount of data to be processed but I would like to know if there is recommendations I didn't follow for that kind of operations. The regular expressions I use are very complicated and I cannot change to something else.

Ideas anyone?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Gabriel S.
  • 1,961
  • 2
  • 20
  • 30
  • 1
    You should show a sample of the keywords (which actually form the subject here?) and the regular expressions. – mario Dec 20 '10 at 11:01
  • Also: Are you merely interested in the match counts or also in the matches themselves? – Tomalak Dec 20 '10 at 11:02
  • Showing keyword samples would be irrelevant as it is only queries made via a search engine. The regexp checks if there is a specific product name in the queries and corresponding ads are displayed. – Gabriel S. Dec 20 '10 at 11:14
  • You are missing a closing bracket after the if statement btw – Jason Dec 20 '10 at 11:52
  • @Gaël: It is not entirely irrelevant what format your data is in. People might be able to show you a more efficient approach when they know exactly what you are working with. Also, why is it a flat file and not a database? – Tomalak Dec 20 '10 at 12:04
  • @Tomalak This is actually values separated by line breaks. It is in this format because they are provided to me in this way and I have no control over it (sensitive information and all...). Moreover, I know the regular expression works, as I tested them manually. For all these reasons I didn't show you the data, but I am aware that in most cases questions cannot be answered without more information on the data. – Gabriel S. Dec 20 '10 at 12:16

2 Answers2

2

There are two optimization options:

  • Regular expressions can usually combined into alternatives /(regex1|regex2|...)/. Oftentimes PCRE can evaluate alternatives faster than PHP can execute a loop.
  • I'm not sure if this is faster at all (modifies the subjects), but you could use the keywords array as parameter to preg_replace_callback() directly, thus eliminating the second loop.

As example:

 $rx = implode("|", $arrayRegexp);  // if it hasn't /regexp/ enclosures

 preg_replace_callback("#($rx)#", "print", $arrayKeywords);

But define a custom print function to output and count the results, and let it just return e.g. an empty string.

Come to think of it, preg_replace_callback would also take an array of regular expressions. Not sure if it cross-checks each regex on each string though.

mario
  • 144,265
  • 20
  • 237
  • 291
  • Your first solution produces a 4'000'000 chars long string of regular expressions and preg_match seems to be unable to cope with that ! As for the second solution, it doesn't seem to run any faster, but quickness isn't the topic here anyway. I will keep this in mind however ! Thank you :-) – Gabriel S. Dec 20 '10 at 11:59
  • @Gaël: That's 4MB of string, which is not very much to begin with. Regex can easily cope with that, and much more efficiently than a one-by-one for loop. While increasing the execution time of your script may solve your immediate problem, it is very probable that a solution exists that outperforms your approach by orders of magnitude. This answer points into the right direction. – Tomalak Dec 20 '10 at 12:20
  • @Tomalak That is why I accepted this answer, specifying it was a temporary solution. As I will only execute this script once or twice when data collecting is finished, it doesn't need real optimizing but simply to work until the end of the files :-) – Gabriel S. Dec 20 '10 at 12:29
-1

Increase execution time

usethis line in .htaccess

php_value max_execution_time 80000
Pradeep Singh
  • 3,582
  • 3
  • 29
  • 42
  • Or just `ini_set('max_execution_time', 80000);` in the script – seriousdev Dec 20 '10 at 11:31
  • I guess `ini_set('max_execution_time', 0);` is more correct since it allows infinite execution time. Anyway this is temporarily fixing my problem. Thanks ! – Gabriel S. Dec 20 '10 at 11:53