12

In PHP, one can pass optional arguments to various XML parsers, one of them being LIBXML_NOENT. The documentation has this to say about it:

LIBXML_NOENT (integer)
Substitute entities

Substitute entities isn't very informative (what entities? when are they substituted?). But I think it's fair to assume that NOENT is short for NO_ENTITIES or NO_EXTERNAL_ENTITIES, so to me it seems to be a fair assumption that this flag disables the parsing of (external) entities.

But that is indeed not the case:

$xml = '<!DOCTYPE root [<!ENTITY c PUBLIC "bar" "/etc/passwd">]>
<test>&c;</test>';
$dom = new DOMDocument();
$dom->loadXML($xml, LIBXML_NOENT);
echo $dom->textContent;

The result is that the content of /etc/passwd is echoed. Without the LIBXML_NOENT argument this is not the case.

For non-external entities, the flag doesn't seem to have any effect. Example:

$xml = '<!DOCTYPE root [<!ENTITY c "TEST">]>
<test>&c;</test>';
$dom = new DOMDocument();
$dom->loadXML($xml);
echo $dom->textContent;

The result of this code is "TEST", with and without LIBXML_NOENT.

The flag doesn't seem to have any effect on pre-defined entities such as &lt;.

So my questions are:

  • What exactly does the LIBXML_NOENT flag do?
  • Why is it called LIBXML_NOENT? What is it short for, and wouldn't LIBXML_ENT or LIBXML_PARSE_EXTERNAL_ENTITIES be a better fit?
  • Is there a flag that actually prevents the parsing of all entities?
miken32
  • 42,008
  • 16
  • 111
  • 154
tim
  • 1,999
  • 17
  • 32
  • 3
    It's [mapped to](https://github.com/php/php-src/blob/ef0279b640b19f6294a1429f9e04019b1f72d69c/ext/libxml/libxml.c#L801) the libxml constant `XML_PARSE_NOENT` if that gives you anything to search on. It is very vaguely described... – miken32 Aug 07 '16 at 23:23

1 Answers1

14

Q: What exactly does the LIBXML_NOENT flag do?

The flag enables the substitution of XML character entity references, external or not.

Q: Why is it called LIBXML_NOENT? What is it short for, and wouldn't LIBXML_ENT or LIBXML_PARSE_EXTERNAL_ENTITIES be a better fit?

The name is indeed misleading. I think that NOENT simply means that the node tree of the parsed document won't contain any entity nodes, so the parser will substitute entities. Without NOENT, the parser creates DOMEntityReference nodes for entity references.

Q: Is there a flag that actually prevents the parsing of all entities?

LIBXML_NOENT enables the substitution of all entity references. If you don't want entities to be expanded, simply omit the flag. For example

$xml = '<!DOCTYPE test [<!ENTITY c "TEST">]>
<test>&c;</test>';
$dom = new DOMDocument();
$dom->loadXML($xml);
echo $dom->saveXML();

prints

<?xml version="1.0"?>
<!DOCTYPE test [
<!ENTITY c "TEST">
]>
<test>&c;</test>

It seems that textContent replaces entities on its own which might be a peculiarity of the PHP bindings. Without LIBXML_NOENT, it leads to different behavior for internal and external entities because the latter won't be loaded.

nwellnhof
  • 32,319
  • 7
  • 89
  • 113
  • Thanks for your answer! In your answer to the third question, you mean `enables` instead of `disables`, right? And is there a way to access the DOM without entities being parsed? Because it's not just `textContent` that does it, it's also `$dom->getElementsByTagName('test')->item(0)->nodeValue`. If I do `print_r($dom->childNodes->item(1));` it also seems to always be parsed, there is no `DOMEntityReference` for internal entities. But for external entities, `LIBXML_NOENT` makes a difference here. The `saveXML` output is indeed different though, even for internal entities. – tim Aug 08 '16 at 13:52
  • @tim I fixed the answer to the third question. `nodeValue` and `textContent` are typically the same. To access the `DOMEntityReference` node, try `$dom->documentElement->childNodes->item(0)` or `$dom->documentElement->firstChild`. – nwellnhof Aug 08 '16 at 14:07