4

I'm using preg_match_all to search through a file that I'm reading in. The file contains many lines of the following format and I'm extracting the numbers between the tags;

<float_array id="asdfasd_positions-array" count="6">1 2 3 4 5 6</float_array>

I'm using preg_match_all and it is working well - except it gets so far through the file then seems to stop.

preg_match_all("/\<float_array id\=\".+?positions.+?\" count\=\".+?\"\>(.+?)\<\/float_array\>/",$file, $results);

The file is 90,000 rows and about 8MB in size. I'm editing every third number in the extracted string and using str_replace to edit it back in to the file. The file is then written again. See the full script here;

http://pastie.org/4300537

The script is sucessfully replacing about half the entries and not doing anything with the second half of the file. I even copied a sucessfully edited line from higher in the file and pasted further down... and it wasn't edited further in the file. It's as if the array if full but memory_limit is set to 500M.

Any ideas?

EDIT: Solution Found

I found the problem - the size of the strings between the tags were too large in some instances and were skipped. I found the limit in PHP. pcre.backtrack_limit is set at 100000 and some strings were larger than this. So I increased this in the .htaccess file using the following line and it now works.

php_value pcre.backtrack_limit 5000000

user1107685
  • 441
  • 1
  • 4
  • 13
  • Are you setting the PHP execution time limit to 0? After 30 seconds or so, the script will just shut off unless you specify it to run for as long as needed. – Tim Withers Jul 22 '12 at 14:17
  • The file is reading in fully, as the `$file` string is being written to a file at the end, and the full file is there. The script fully executes, i'm resetting the timeout within the loop. If I echo on the final line that echos fine. – user1107685 Jul 22 '12 at 14:20
  • Too many backslashes (`<` and `=` doesn't need to). Also use single quotes. And constrain the format further `[\w-]+` or `\d+` and `[\d\s]*` in place of all the `.+?`. Should it be valid XML, also try SimpleXML instead; much simpler, and not measurably slower. – mario Jul 22 '12 at 14:23
  • Thanks Mario - a more typical ID is something like "_10iHdUVMXDPhBIJhh1IGZa-positions-array". Will your suggestion cover the "_" and "-" characters alright? The number and positions of those characters also vary. – user1107685 Jul 22 '12 at 14:27

2 Answers2

2

If memory is an issue and not execution time limit, then go wth slow solution (line by line) >>

$fi = fopen("data.txt",  "r");
$fo = fopen('data2.txt', 'w');
while (!feof($fi)) {
  $line = fgets($fi);

  # regex stuff here

  fwrite($fo, $line);
}
fclose($fi);
fclose($fo);
Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • Why would memory cause it though - I thought memory_limit was the only limiting factor for the array or string length? – user1107685 Jul 22 '12 at 14:32
  • @user1107685 - What else can be an issue? If script works with 1st half, then it should work with the 2nd one as well. In most cases, the execution time limit and/or memory is behind that, so... try that - ***if it is not going to work*** now, ***then it is not a memory what cause it***... Simple! – Ωmega Jul 22 '12 at 14:35
  • A tag won't necessarily stop at a line boundary though; you are better off using `preg_match` and specifying `PREG_OFFSET_CAPTURE`, so you can handle each result separately. This does require 8 MiB of file cache, but you get better handling. Besides that, not only will tags cross line boundaries, the whole XML could be contained in a single, 8 MiB line. Alteratively, parse the XML, then parse the returned float array. – Maarten Bodewes Jul 22 '12 at 14:49
  • I'm having a go at the line by line method now. Each line has 1 tag so that's not a problem, they don't cross line boundaries. Will be interested to see how much slower this is too. – user1107685 Jul 22 '12 at 14:53
  • This is mostly working - and actually much faster than previous code - but it's still skipping over a few. It's skipping those with the largest number of numbers between the tags, around 14,000 numbers. – user1107685 Jul 22 '12 at 16:09
  • @user1107685 - You might want to try some regex modification - such as http://ideone.com/ctYdx (**test it!**) - maybe your regex is not what really match what you want. We don't see your input text. Show us some examples of those that are skipped (not matched), so we can analyze it... – Ωmega Jul 22 '12 at 16:30
  • Is there a limit in regex for a value? Will it keep recording `(.+?)` for that long before reaching the ? – user1107685 Jul 22 '12 at 16:34
  • @user1107685 - You are dealing with non-regex problem: I believe PHP has problem to read such long line. Try to print to output `strlen` of the string you read from file to see what you get... – Ωmega Jul 22 '12 at 16:47
  • @user1107685 - See this: http://ideone.com/MAauo - it reads maximum 65535 bytes, but input has 138298 bytes. That is your problem! – Ωmega Jul 22 '12 at 16:55
  • Thanks for your help with this Ωmega. What's causing that limit do you know? And way of raising it? – user1107685 Jul 22 '12 at 17:06
  • Ωmega I tried your code and it seems 65535 bytes is a limit on IDE One. I tried this (http://pastebin.com/raw.php?i=Y8kmRFBb) on my own hosting account and put that line into snippet.dae and it displays "138298". (??) – user1107685 Jul 22 '12 at 17:30
  • @user1107685 - I answered your question because regex is a field that I work with, not PHP. I suggest you to open a new post with question related just to *"How to read text file line by line with possible lenght of line exceeding 65535?"* to see what PHP coders will come with... Good luck! – Ωmega Jul 22 '12 at 17:34
  • I'm not getting that limit on my hosting though. So still trying to find the problem here. – user1107685 Jul 22 '12 at 17:38
  • @user1107685 - Well, then try to execute it with regex that I used on http://ideone.com/ctYdx If it is not going to work, then there might be a limit on regex... – Ωmega Jul 22 '12 at 17:41
  • @Ωmega - I run this (http://pastie.org/pastes/4301606/text) if snippet.dae contains just the long line it prints an empty array. If I add other smaller lines to the file these do get added to the array. The full file is reading (using strlen and also try echoing it back). I can't find anything about length limits with regex but that appears to be what's happening... – user1107685 Jul 22 '12 at 18:14
  • SOLVED! I found the limit in PHP. pcre.backtrack_limit is set at 100000 so I added this to .htaccess `file php_value pcre.backtrack_limit 2000000` and it works now! Thanks a lot for your help @Ωmega! – user1107685 Jul 22 '12 at 18:22
0

You might consider to parse your text file with simple parser like this >>

$fi = fopen("data.txt",  "r");
$fo = fopen('data2.txt', 'w');
$status = 0;
do {
  $data = stream_get_line($fi, PHP_INT_MAX, ">");
  if ($status == 1) {
    preg_match("/(.*)<\/float_array$/", $data, $m);
    $status--;
    if (sizeof($m) != 0) {
      fwrite($fo, $m[1] . "\n");
      continue;
    }
  }
  if ($status == 0) {
    preg_match("/<float_array[^>]*?\bid\s*=\s*[\"'][^\"']*?positions[^\"']*?[\"'][^>]*?\bcount\s*\=[^>]*?$/", $data, $m);
    if (sizeof($m) > 0) {
      $status++;
    }
  }
} while (!feof($fi));
fclose($fi);
fclose($fo);
Ωmega
  • 42,614
  • 34
  • 134
  • 203