5

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [@MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [@MERGEID] becomes %5B@MERGEID%5D.

Later in my code I need to replace [@MERGEID] with an ID. So I search for urlencode('[@MERGEID]') - however, urlencode() changes the commercial at symbol (@) to %40, while saveHTML() has left it alone. So there is no match - '%5B@MERGEID%5D' != '%5B%40MERGEID%5D'

Now, I know can run str_replace('%40', '@', urlencode('[@MERGEID]')) to get what I need to locate the merge variable in $message.

My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?

Demo code:

$message = '<a href="http://www.google.com?ref=abc" data-tag="thebottomlink">Google</a>';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));       
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {    
    $link = $element->getAttribute('href'); //http://www.google.com?ref=abc
    $tag = $element->getAttribute('data-tag'); //thebottomlink
    if ($link) {
        $newlink = 'http://www.example.com/click/[@MERGEID]?url=' . $link;
        if ($tag) {
            $newlink .= '&tag=' . $tag;
        } 
        $element->setAttribute('href', $newlink);
    }
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[@MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge); 
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B@MERGEID%5D?url=http://www.google.com?ref=abc&amp;tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
Luke Shaheen
  • 4,262
  • 12
  • 52
  • 82
  • This is of interest to me as well. Have you tried using utf8_encode/decode or iconv per the manual? – Kevin_Kinsey Dec 04 '14 at 20:03
  • @Kkinsey Run utf8_encode/decode or iconv on what? – Luke Shaheen Dec 04 '14 at 20:05
  • Nevermind. Mistake on my part, I think. I'll look again. – Kevin_Kinsey Dec 04 '14 at 21:01
  • To paraphrase the question: Why is the character `@` not percent-encoded in the value of a `DOMAttribute` node when using `DOMDocument::saveHTML`? – Alf Eaton Dec 10 '14 at 16:35
  • Would it not make sense to just urlencode the original [@mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace? $newlink = 'http://www.example.com/click/'.urlencode('[@MERGEID]').'?url=' . $link; – Gavin Simpson Dec 11 '14 at 21:09
  • @GavinSimpson The problem is that the code being passed in, `$message`, is a user-generated template. So they can write their own template, with their own code. – Luke Shaheen Dec 13 '14 at 15:04
  • First off thanks for the tip, and making feel stupid once again :) '$dom_document->loadHTML(utf8_decode(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8')));' will leave both outputs as '%5B%40MERGEID%5D'. Would that help? – Gavin Simpson Dec 13 '14 at 16:36

5 Answers5

5

I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.

For example:

urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com

This is convenient for encoding the query part, but it cannot be used on <a href='...'>.

However:

$element->setAttribute('href', $newlink); // -> http://www.google.com

will properly encode the string so that it is still usable in href. The reason that it cannot encode @ because it cannot tell whether @ is a part of the query or is it part of the userinfo or email url (for example: mailto:invisal@google.com or invisal@127.0.0.1)


Solution

  1. Instead of using [@MERGEID], you can use @@MERGEID@@. Then, you replace that with your ID later. This solution does not require you to even use urlencode.

  2. If you insist to use urlencode, you can just use %40 instead of @. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;

  3. You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[@MERGEID]') . '?url=' . $link;

RandomSeed
  • 29,301
  • 6
  • 52
  • 87
invisal
  • 11,075
  • 4
  • 33
  • 54
  • How does this answer the question? – Phil Dec 08 '14 at 02:15
  • @Phil, he asked why those two produce different result when encode `[@MERGEID]` – invisal Dec 08 '14 at 02:16
  • @invisal I understand what you're saying. So, given that explanation, there isn't really a way around having to run my extra `str_replace`? – Luke Shaheen Dec 08 '14 at 13:34
  • @John, YES and NO, check my solution section. – invisal Dec 09 '14 at 02:27
  • 1, 2, and 3 are all possible solutions for the problem posed in my question, thanks! Unfortunately none of them will work for my own real world problem - the merge variables are inserted by a front-end user, and have been for some time now, so a change from what they know isn't totally possible right now. 2 & 3 won't work because many of the links are already created inside of `$message` so I don't have access to split them up like I do `$newlink`. – Luke Shaheen Dec 09 '14 at 13:27
3

urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.

On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.

Here is a function to decode URLs according to RFC 3986.

<?php
    function myUrlEncode($string) {
       $entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
       $replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
       return str_replace($entities, $replacements, urldecode($string));
    }
?>

PHP Fiddle.


Update:

Since UTF8 has been used to encode $message:

$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))

Use urldecode($message) when returning the URL without percents.

die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge); 
Community
  • 1
  • 1
carlodurso
  • 2,886
  • 4
  • 24
  • 37
2

The root cause of your problem has been very well explained from a technical point of view.

In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.

By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.

Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:

$token = 'blah blah [@MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';

$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document

// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);

// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);

echo $dom->saveHTML();

As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.

(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)


Complete proof of concept:

function searchAndReplace(DOMNode $node, $search, $replace) {
    if($node->hasAttributes()) {
        foreach ($node->attributes as $attribute) {
            $input = $attribute->nodeValue;
            $output = str_replace($search, $replace, $input);
            $attribute->nodeValue = $output;
        }
    }

    if(!$node instanceof DOMElement) { // this test needs double-checking
        $input = $node->nodeValue;
        $output = str_replace($search, $replace, $input);
        $node->nodeValue = $output;
    }

    if($node->hasChildNodes()) {
        foreach ($node->childNodes as $child) {
            searchAndReplace($child, $search, $replace);
        }
    }
}

$token = '<>&;[@MERGEID]';
$message = '<a/>';

$dom = new DOMDocument();
$dom->loadHTML($message);

$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo@$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);

echo $dom->saveHTML();

searchAndReplace($dom, $token, '*replaced*');

echo $dom->saveHTML();
Community
  • 1
  • 1
RandomSeed
  • 29,301
  • 6
  • 52
  • 87
0

If you use saveXML() it won't mess with the encoding the way saveHTML() does:

PHP

//your code...
$message = $dom_document->saveXML();

EDIT: also remove the XML tag:

//this will add an xml tag, so just remove it
$message=preg_replace("/\<\?xml(.*?)\?\>/","",$message);

echo $message;

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a href="http://www.example.com/click/[@MERGEID]?url=http://www.google.com?ref=abc&amp;tag=thebottomlink" data-tag="thebottomlink">Google</a></body></html>

Notice that both still correctly convert & to &amp;

chiliNUT
  • 18,989
  • 14
  • 66
  • 106
0

Would it not make sense to just urlencode the original [@mergeid] whan saving it in the first place as well? Your search should then match without the need for the str_replace?

$newlink = 'http://www.example.com/click/'.urlencode('[@MERGEID]').'?url=' . $link;

I know this does not answer the first post of the question, but you cannot post code in comments as far as I can tell.

Gavin Simpson
  • 2,766
  • 3
  • 30
  • 39
  • 2
    You can post code in the comments - when adding a comment, off the bottom right corner of the textarea is [a "help" link](http://stackoverflow.com/editing-help#comment-formatting) :) Simply use a backtick. – Luke Shaheen Dec 13 '14 at 15:05