3

the base of this question comes from the fact that in many latin languages, and also in many non-latin languages there are letters that from what I've seen, up until recently were not really usable in URLs and nearly always ended up generating a big bunch of URL encoded characters.

But, recently I've seen several sites using native letters in URLs (except for domain).

Something like this for example using spanish accented letters:

https://www.example.com/esta-es-una-frase-en-español
https://www.example.com/cómo-usar-acentos-y-la-letra-ñ-en-urls

Also, I've seen URLs like

https://www.example.com/урл-на-български

From what I remember in terms of experience, not so long ago one had to either encode or convert accented characters to non-accented ones.

But now you can use this type of URL in the browser and it makes no issue and the letters appear as they should (not URL-encoded).

Is it safe to assume that now my URLs can handle these characters?

Also, is there any difference in terms of URL indexing for Google?

Mihail Minkov
  • 2,463
  • 2
  • 24
  • 41
  • I assume you are using UTF-8 encoding, not latin1, UTF-16, etc. – Rick James Aug 05 '20 at 20:08
  • I am specifically talking about URL characters, literally put in the browser's address line, as far as I know, the encoding of that would be either browser or OS related. – Mihail Minkov Aug 05 '20 at 21:24

1 Answers1

2

URIs/URLs, as defined by RFC 3986 "Uniform Resource Identifier (URI): Generic Syntax", do not allow unencoded non-ASCII characters. Such characters must be charset-encoded (usually to UTF-8) and the resulting byte octets are then percent-encoded. If a browser is given a URL with unencoded Unicode characters in it, the browser will typically url-encode it properly behind the scenes when transmitting it to a web server. You can verify this with your browser's built-in debugger (if it has one) or an HTTP/S sniffer.

IRIs, as defined by RFC 3987 "Internationalized Resource Identifiers (IRIs)", do allow unencoded Unicode characters. IRIs are not in widespread use yet, however IRIs can maintain backwards compatibility by mapping to/from encoded URIs/URLs. It is possible that your browser may be treating the content of the address bar as an IRI, converting it to/from an URI/URL internally as needed.

Community
  • 1
  • 1
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • Ok, so should there be a compatibility problem or I can use unicode characters freely? I did check in the network request and there it appears URL-encoded. – Mihail Minkov Aug 05 '20 at 22:51
  • You can use Unicode characters in a browser address bar, they will likely get encoded for you. But you can't use *unencoded* Unicode characters in your own URLs, no. – Remy Lebeau Aug 05 '20 at 23:59
  • I do not completely understand what you mean by this. You are saying that if I write `https://www.example.com/señor` that would work, but if I create an html link with the same URL it won't? – Mihail Minkov Aug 06 '20 at 00:21
  • @MihailMinkov yes, that is what I'm saying. – Remy Lebeau Aug 06 '20 at 00:32
  • But the thing with that is that I AM generating links so, the link is something like `Spanish Link` and when I click on it, the route I have defined for this works. I am using CodeIgniter 4. – Mihail Minkov Aug 06 '20 at 03:33
  • "Legal". It is allowed. Many Wikipedia pages have such URL, and use such anchor. Just Browsers will escape URL for HTTP, and often they do the same on reverse. For sure there is also a W3C document for that. – Giacomo Catenazzi Aug 06 '20 at 06:57
  • in fact: https://www.w3.org/International/articles/idn-and-iri/ – Giacomo Catenazzi Aug 06 '20 at 06:59
  • So, @GiacomoCatenazzi supposedly it is usable and compatible? Is there a way to check compatibility? Something like caniuse.com? I checked there, but it doesn't appear. – Mihail Minkov Aug 06 '20 at 14:12
  • @MihailMinkov: That document tall us "The conversion process for parts of the IRI relating to the path is already supported natively in the latest versions of IE7, Firefox, Opera, Safari and Google Chrome.". Server side may have more problem. Wikipedia is using it since a lot of time (so I assume it is safe) – Giacomo Catenazzi Aug 06 '20 at 14:34
  • Could you post this as an answer @GiacomoCatenazzi ? – Mihail Minkov Aug 06 '20 at 16:09