I am processing XML files read directly from a zip archive using a PHP zip stream. Occasionally these files contain large CDATA chunks that is not relevant for me, but make the SimpleXml processing to run out of memory.
I thought implementing a stream filter to remove these chunks before passing the data to simple_xml_load_string
would solve the problem. But PHP uses exactly the same amount of memory with and without the filter.
My stream filter looks like this:
class JunkContentsNodesStreamFilter extends \php_user_filter
{
const START_MARKER = '<Contents><!\\[CDATA\\[';
const END_MARKER = '\\]\\]></Contents>';
private $skipping = false;
public function filter($in, $out, &$consumed, $closing)
{
while ($bucket = stream_bucket_make_writeable($in)) {
// Always consume all input.
$consumed = $bucket->datalen;
// Entire match in the same bucket. Just remove it.
$bucket->data = preg_replace('^' . self::START_MARKER . '.*?' . self::END_MARKER . '^ms', '', $bucket->data);
if ($this->skipping) {
$pos = strpos($bucket->data, self::END_MARKER);
if ($pos === false) {
// Skip entire block
$bucket->data = '';
} else {
// Found an end marker. Remove and stop skipping
$bucket->data = substr($bucket->data, $pos + strlen(self::END_MARKER));
$this->skipping = false;
}
} else {
$pos = strpos($bucket->data, self::START_MARKER);
if ($pos !== false) {
// Found a start marker. Remove and start skipping
$bucket->data = substr($bucket->data, 0, $pos);
$this->skipping = true;
}
}
$bucket->datalen = strlen($bucket->data);
stream_bucket_append($out, $bucket);
}
return PSFS_PASS_ON;
}
}
And I use it like this:
stream_filter_register('junk_contents_nodes', 'JunkContentsNodesStreamFilter');
$data = file_get_contents('php://filter/read=junk_contents_nodes/resource=zip://pathtozip.zip#fileinzip.xml');
It does return the stripped contents, but the memory usage does not go down at all. The original data can be around 50 Mb and the stripped data about 150 Kb, so I expected to see some difference.