DOMDocument->saveHTML() vs urlencode with commercial at symbol (@)

Question

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [@MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [@MERGEID] becomes %5B@MERGEID%5D.

Later in my code I need to replace [@MERGEID] with an ID. So I search for urlencode('[@MERGEID]') - however, urlencode() changes the commercial at symbol (@) to %40, while saveHTML() has left it alone. So there is no match - '%5B@MERGEID%5D' != '%5B%40MERGEID%5D'

Now, I know can run str_replace('%40', '@', urlencode('[@MERGEID]')) to get what I need to locate the merge variable in $message.

My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?

Demo code:

$message = '<a href="http://www.google.com?ref=abc" data-tag="thebottomlink">Google</a>';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));       
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {    
    $link = $element->getAttribute('href'); //http://www.google.com?ref=abc
    $tag = $element->getAttribute('data-tag'); //thebottomlink
    if ($link) {
        $newlink = 'http://www.example.com/click/[@MERGEID]?url=' . $link;
        if ($tag) {
            $newlink .= '&tag=' . $tag;
        } 
        $element->setAttribute('href', $newlink);
    }
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[@MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge); 
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B@MERGEID%5D?url=http://www.google.com?ref=abc&amp;tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D

This is of interest to me as well. Have you tried using utf8_encode/decode or iconv per the manual? — Kevin_Kinsey, Dec 04 '14 at 20:03
To paraphrase the question: Why is the character `@` not percent-encoded in the value of a `DOMAttribute` node when using `DOMDocument::saveHTML`? — Alf Eaton, Dec 10 '14 at 16:35
Would it not make sense to just urlencode the original [@mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace? $newlink = 'http://www.example.com/click/'.urlencode('[@MERGEID]').'?url=' . $link; — Gavin Simpson, Dec 11 '14 at 21:09
@GavinSimpson The problem is that the code being passed in, `$message`, is a user-generated template. So they can write their own template, with their own code. — Luke Shaheen, Dec 13 '14 at 15:04
First off thanks for the tip, and making feel stupid once again :) '$dom_document->loadHTML(utf8_decode(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8')));' will leave both outputs as '%5B%40MERGEID%5D'. Would that help? — Gavin Simpson, Dec 13 '14 at 16:36

score 5 · Answer 1 · edited Dec 14 '14 at 12:04

5

I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.

For example:

urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com

This is convenient for encoding the query part, but it cannot be used on <a href='...'>.

However:

$element->setAttribute('href', $newlink); // -> http://www.google.com

will properly encode the string so that it is still usable in href. The reason that it cannot encode @ because it cannot tell whether @ is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal@google.com or invisal@127.0.0.1)

Solution

Instead of using [@MERGEID], you can use @@MERGEID@@. Then, you replace that with your ID later. This solution does not require you to even use urlencode.
If you insist to use urlencode, you can just use %40 instead of @. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[@MERGEID]') . '?url=' . $link;

edited Dec 14 '14 at 12:04

RandomSeed

29,301
6
52
87

answered Dec 08 '14 at 02:09

invisal

11,075
4
33
54

How does this answer the question? – Phil Dec 08 '14 at 02:15
@Phil, he asked why those two produce different result when encode `[@MERGEID]` – invisal Dec 08 '14 at 02:16
@invisal I understand what you're saying. So, given that explanation, there isn't really a way around having to run my extra `str_replace`? – Luke Shaheen Dec 08 '14 at 13:34
@John, YES and NO, check my solution section. – invisal Dec 09 '14 at 02:27
1, 2, and 3 are all possible solutions for the problem posed in my question, thanks! Unfortunately none of them will work for my own real world problem - the merge variables are inserted by a front-end user, and have been for some time now, so a change from what they know isn't totally possible right now. 2 & 3 won't work because many of the links are already created inside of `$message` so I don't have access to split them up like I do `$newlink`. – Luke Shaheen Dec 09 '14 at 13:27

score 3 · Answer 2 · edited Oct 07 '21 at 06:01

urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.

On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.

Here is a function to decode URLs according to RFC 3986.

<?php
    function myUrlEncode($string) {
       $entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
       $replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
       return str_replace($entities, $replacements, urldecode($string));
    }
?>

PHP Fiddle.

Update:

Since UTF8 has been used to encode $message:

$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))

Use urldecode($message) when returning the URL without percents.

die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);

Why haven't `urlencode` and `rawurlencode` been updated to use the url standard RFC? — Luke Shaheen, Dec 08 '14 at 13:27
It seems there were other accepted ways of doing URL encoding in the past. — carlodurso, Dec 08 '14 at 15:08
But, is running `urldecode` any different in resources or time than running `str_replace`? — Luke Shaheen, Dec 09 '14 at 13:19
It is my understanding that under the hood they both use regular expressions. A quick test shows no relevant difference. — carlodurso, Dec 09 '14 at 14:58

score 2 · Answer 3 · edited May 23 '17 at 12:22

The root cause of your problem has been very well explained from a technical point of view.

In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.

By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.

Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:

$token = 'blah blah [@MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';

$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document

// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);

// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);

echo $dom->saveHTML();

As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.

(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)

Complete proof of concept:

function searchAndReplace(DOMNode $node, $search, $replace) {
    if($node->hasAttributes()) {
        foreach ($node->attributes as $attribute) {
            $input = $attribute->nodeValue;
            $output = str_replace($search, $replace, $input);
            $attribute->nodeValue = $output;
        }
    }

    if(!$node instanceof DOMElement) { // this test needs double-checking
        $input = $node->nodeValue;
        $output = str_replace($search, $replace, $input);
        $node->nodeValue = $output;
    }

    if($node->hasChildNodes()) {
        foreach ($node->childNodes as $child) {
            searchAndReplace($child, $search, $replace);
        }
    }
}

$token = '<>&;[@MERGEID]';
$message = '<a/>';

$dom = new DOMDocument();
$dom->loadHTML($message);

$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo@$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);

echo $dom->saveHTML();

searchAndReplace($dom, $token, '*replaced*');

echo $dom->saveHTML();

chiliNUT · Answer 4 · 2014-12-09T16:02:17.397

0

If you use saveXML() it won't mess with the encoding the way saveHTML() does:

PHP

//your code...
$message = $dom_document->saveXML();

EDIT: also remove the XML tag:

//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);

echo $message;

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a href="http://www.example.com/click/[@MERGEID]?url=http://www.google.com?ref=abc&amp;tag=thebottomlink" data-tag="thebottomlink">Google</a></body></html>

Notice that both still correctly convert & to &

edited Dec 09 '14 at 16:02

answered Dec 08 '14 at 03:03

chiliNUT

18,989
14
66
106

Since it's actually an HTML document, what ramifications will running `saveXML` have? – Luke Shaheen Dec 08 '14 at 13:28
saveXML will output an xhtml document, and will also stick the ` – chiliNUT Dec 08 '14 at 14:35
Sounds like it's not really a solution then hot an HTML document – Luke Shaheen Dec 09 '14 at 13:20
i don't know what you mean – chiliNUT Dec 09 '14 at 15:34
then for a HTML document* - sorry, typo. Saving HTML as XML, and adding XML declaration tags isn't valid at all - put your output through the [w3c validator](http://validator.w3.org/check) and look at the output. So, this isn't a solution at all. – Luke Shaheen Dec 09 '14 at 15:45

score 0 · Answer 5 · answered Dec 11 '14 at 21:10

0

Would it not make sense to just urlencode the original [@mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?

$newlink = 'http://www.example.com/click/'.urlencode('[@MERGEID]').'?url=' . $link;

I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.

answered Dec 11 '14 at 21:10

Gavin Simpson

2,766
3
30
39

2

You can post code in the comments - when adding a comment, off the bottom right corner of the textarea is [a "help" link](http://stackoverflow.com/editing-help#comment-formatting) :) Simply use a backtick. – Luke Shaheen Dec 13 '14 at 15:05

DOMDocument->saveHTML() vs urlencode with commercial at symbol (@)

5 Answers5

Solution