0

I'm using the built in XMLReader in php to read data from external xml feeds. When I try to read a feed that starts with a new line, I get the following error:

ErrorException: XMLReader::read(): http://example.com/feeds/feed1.xml:2: parser error : XML declaration allowed only at the start of the document

I think it's because the feed starts with a new line, but I don't know how to solve the problem? How can I make it skip the first line if it contains a newline?

I can't seem to find anyone how has solved this problem. They have some workaround using the SimpleXMLElement, but I cant load the entire document into memory.

Here is my code:

$reader = new XMLReader;
$reader->open($linkToExternalFeed);

while ($reader->read() && $reader->name != 'item');

while ($reader->name == 'item')
{
    $node = new SimpleXMLElement($reader->readOuterXML());

    $this->doSomeParsing($node);

    unset($node);

    $reader->next($reader->name);
}

$reader->close();
Christian Gerdes
  • 279
  • 1
  • 2
  • 16
  • "I think it's because the feed starts with a new line" If the site was serving invalid XML, I expect no parser would work with it, making the document pretty useless. Have you confirmed that extra whitespace is present? – miken32 Nov 08 '18 at 19:47
  • does this maybe help: https://stackoverflow.com/a/5479544/2412335 – digijay Nov 08 '18 at 19:47
  • The file does contain a new line at the beginning. I can make the parser work using SimpleXMLElement. It only throws an error when using XMLReader. I can't use that thread since it requires me the load the entire feed into memory – Christian Gerdes Nov 08 '18 at 19:51
  • The parser is right to reject the file, because it isn't well-formed XML. However, unlike most of the posts we get asking how to process bad XML input, in this case the damage is very easy to repair by trimming off the invalid whitespace before sending the content to the XML parser. Nevertheless, the usual advice applies: if someone is sending you bad XML you should get them to mend their ways. – Michael Kay Nov 08 '18 at 21:37
  • I understand, but it would be awesome if we were able to parse the feed even when it contains a line break at the beginning. – Christian Gerdes Nov 08 '18 at 21:41

2 Answers2

2

You could write a streamwrapper that filters the stream. After it finds the first non whitespace it would remove the filter and start passing the data to XMLWriter.

class ResourceWrapper {

    private $_stream;

    private $_filter;

    private $context;

    public static function createContext(
        $stream, callable $filter = NULL, string $protocol = 'myproject-resource'
    ): array {
        self::register($protocol);
        return [
            $protocol.'://context', 
            \stream_context_create(
                [
                    $protocol => [
                        'stream' => $stream,
                        'filter' => $filter
                    ]
                ]
            )
        ];
    }

    private static function register($protocol) {
        if (!\in_array($protocol, \stream_get_wrappers(), TRUE)) {
            \stream_wrapper_register($protocol, __CLASS__);
        }
    }

    public function removeFilter() {
        $this->_filter = NULL;
    }

    public function url_stat(string $path , int $flags): array {
        return [];
    }

    public function stream_open(
        string $path, string $mode, int $options, &$opened_path
    ): bool {
        list($protocol, $id) = \explode('://', $path);
        $context = \stream_context_get_options($this->context);
        if (
            isset($context[$protocol]['stream']) &&
            \is_resource($context[$protocol]['stream'])
        ) {
            $this->_stream = $context[$protocol]['stream'];
            $this->_filter = $context[$protocol]['filter'];
            return TRUE;
        }
        return FALSE;
    }

    public function stream_read(int $count) {
        if (NULL !== $this->_filter) {
            $filter = $this->_filter;
            return $filter(\fread($this->_stream, $count), $this);
        }
        return \fread($this->_stream, $count);
    }

    public function stream_eof(): bool {
        return \feof($this->_stream);
    }
}

Usage:

$xml = <<<'XML'


<?xml version="1.0"?>
<person><name>Alice</name></person>
XML;

// open the example XML string as a file stream
$resource = fopen('data://text/plain;base64,'.base64_encode($xml), 'rb');

$reader = new \XMLReader();
// create context for the stream and the filter
list($uri, $context) = \ResourceWrapper::createContext(
    $resource,
    function($data, \ResourceWrapper $wrapper) {
        // check for content after removing leading white space
        if (ltrim($data) !== '') {
            // found content, remove filter
            $wrapper->removeFilter();
            // return data without leading whitespace
            return ltrim($data);
        }
        return '';
    }
);
libxml_set_streams_context($context);
$reader->open($uri);

while ($foundNode = $reader->read()) {
    var_dump($reader->localName);
}

Ouput:

string(6) "person" 
string(4) "name" 
string(5) "#text" 
string(4) "name" 
string(6) "person"
ThW
  • 19,120
  • 3
  • 22
  • 44
  • Is there any documentation which has all this sort of information in it? – Nigel Ren Nov 09 '18 at 11:32
  • Well, the PHP manual at http://php.net/streamwrapper A while ago a friend asked me to extend XMLReader with a method to attach a stream rather then opening an URI. I stripped that implementation down for the answer. – ThW Nov 09 '18 at 12:18
  • I think the missing link for me was the `libxml_set_streams_context()`, but thanks - off to RTFM :) – Nigel Ren Nov 09 '18 at 13:45
0

Not ideal, but this will just read the source and ltrim() the first part of the content and write it to a temporary file, you should then be able to read the file called $tmpFile...

$tmpFile = tempnam(".", "trx");
$fpIn = fopen($linkToExternalFeed,"r");
$fpOut = fopen($tmpFile, "w");
$buffer = fread($fpIn, 4096);
fwrite($fpOut, ltrim($buffer));
while ( $buffer = fread($fpIn, 4096))    {
    fwrite($fpOut, $buffer);
}
fclose($fpIn);
fclose($fpOut);

I use tmpname() to generate a unique file name, you could set this to anything which you feel happy with. It may also be useful to delete this file once you've processed it to save space and remove potentially sensitive information.

Nigel Ren
  • 56,122
  • 11
  • 43
  • 55