2

I am writing php code that generates html that contains links to documents via their DOI. The links should point to https://doi.org/ followed by the DOI of the document.

As the results is a url, I thought I could simply use php's esc_url() function like in

echo '<a href="' . esc_url('https://doi.org/' . $doi)) . '">' . esc_url('https://doi.org/' . $doi)) . '</a>';

as this is what one is supposed to use in text nodes, attribute nodes or anywhere else. Unfortunately things apparenty aren't that easy...

The problem is that DOIs can contain all sorts of special characters that are apparently not handled correctly by esc_url(). A nice example of such a DOI is

10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P

which is supposed to link to

https://doi.org/10.1002/(SICI)1521-3978(199806)46:4/5<493::AID-PROP493>3.0.CO;2-P

With $doi equal to this DOI the above code however produces a link that is displayed and links to https://doi.org/10.1002/​(SICI)1521-3978(199806)46:4/​5493::AID-PROP4933.0.CO;2-P.

This leads me to the question: If esc_url() is obviously not one-size-fits-all no-brained solution to escaping urls, then what should I use? For this case I can get the result I want with

esc_url(htmlspecialchars('https://doi.org/' . $doi))

but is this really the right way™ of doing it? Does this have any other unwanted side effects? If not, then why does esc_url() not also escape < and >? Would esc_html() be better than htmlspecialchars()? If so, should I nest it into a esc_url()?

I am aware that there are many articles on escaping urls in php on stackoverflow, but I couldn't find one that addresses the issues of < and > signs.

cgogolin
  • 960
  • 1
  • 10
  • 22
  • When you [check the source code](https://core.trac.wordpress.org/browser/tags/4.8/src/wp-includes/formatting.php#L3775), you see that this function removes any characters from the URL, that do not match the regular expression character class `[^a-z0-9-~+_.?#=!&;,/:%@$\|*\'()\[\]\\x80-\\xff]` ( Introducing characters such a `<` and `>` into identifiers that are eventually supposed to become part of an HTTP URL is what I would call a rather dipshit decision on part of the DOI people to begin with though ...) – CBroe Aug 17 '17 at 10:52
  • I fully agree on the "dipshit decision" part :). – cgogolin Aug 17 '17 at 11:13

1 Answers1

0

I'm no PHP expert, but I do know about DOIs and SICIs can be really annoying.

URL-encoding and HTML encoding are separate things, so it makes sense to think about them separately. You must escape the angle-brackets to make correct HTML. As for the URL-escaping, you should also do this because there are other characters that might break URLs (such as the # character, which also pops up from time to time).

So I would recommend:

'https://doi.org/' . htmlspecialcharacters(urlencode($doi))

Which will give you:

<a href="https://doi.org/10.1002%2F%28SICI%291521-3978%28199806%2946%3A4%2F5%3C493%3A%3AAID-PROP493%3E3.0.CO%3B2-P">Click here</a>

Note the order of function application, and the fact that you don't want to encode the https://doi.org resolver!

To the above "dipshit decision" comment... it's certainly inconvenient. But SICIs were around before DOIs and it's one of those annoying things we've had to live with ever since!

Joe
  • 46,419
  • 33
  • 155
  • 245
  • What is the argument for `urlencode()` over `rawurlencode()` and why is `htmlspecialcharacters()` necessary at all? Doesn't the output of `urlencode()` only contain alphanumeric characters and + and - signs? – cgogolin Aug 18 '17 at 12:03
  • Oh, it looks like `urlencode` produces HTML-safe characters so `htmlspecialcharaters` isn't necessary. However, the specification suggests that you do do both: http://php.net/manual/en/function.urlencode.php – Joe Aug 18 '17 at 15:00
  • As for `urlencode` vs `rawurlencode`, from the PHP docs it looks like the only difference is treatment of the `~` character. However the two pages don't substantively reference each other. – Joe Aug 18 '17 at 15:03
  • What about the encoding of the doi in the link text (between and ? – Sybille Peters Apr 26 '23 at 15:12