9

I've a program which generates XML sitemaps for Google Webmaster Tools (among other things).
GWTs is giving me errors for some sitemaps because the URLs contain character sequences like ã¾, ã‹, ã€, etc. **

Sitemap specification says:

Your Sitemap file must be UTF-8 encoded (you can generally do this when you save the file). As with all XML files, any data values (including URLs) must use entity escape codes for the characters listed: &, ', ", <, >.

The special characters are escaped in the XML files (with HTML entities). XML file snippet:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url>
        <loc>http://domain/folder/listing-&#227;&#129;.shtml</loc>
        ...

Are my URLs UTF-8 encoded? If not, how do I do this in Java? The following is the line in my program where I add the URL to the sitemap:

    siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase()));

I'm not sure which ones are causing the error, probably the first two examples.

Braiam
  • 1
  • 11
  • 47
  • 78
Adam Lynch
  • 3,341
  • 5
  • 36
  • 66
  • Open your sitemap XML files in an editor that supports UTF-8 encoding (like Notepad++) for a quick test to determine whether your files are saved in the correct encoding. – Vineet Reynolds May 23 '11 at 11:40
  • @Vineet Done. Not certain where to look to see if the URLs are correctly UTF-8 encoded. I've supplied a snippet of the XML file. It looks like the characters have been escaped (with HTML entities). – Adam Lynch May 23 '11 at 11:50
  • the Encoding menu in Notepad++ will allow you to view the current encoding used. You could change the encoding of the file, but that is not the point; use the suggested approach to specify the encoding for the URL. Additionally, also ensure that you write the sitemap file using UTF-8 encoding (when you use the FileOutputStream class or a different class). – Vineet Reynolds May 23 '11 at 12:01
  • I don't really understand your question. It seems as though you haven't HTML escaped you data (regardless of using utf-8). Are you escaping or not? – Assaf Lavie May 23 '11 at 11:32
  • I edited the question a lot. – Adam Lynch May 23 '11 at 11:33

4 Answers4

17

Try using URLEncoder.encode(stringToBeEncoded, "UTF-8") to encode the url.

Jai
  • 3,549
  • 3
  • 23
  • 31
  • 4
    This will `application/x-www-form-urlencoded` encode the string. This is generally only acceptable for parameters used in the query part. It would not encode the path part segments correctly, for example. – McDowell May 23 '11 at 11:45
  • How sure are you this will work? Are you suggesting I change the line to `siteMap.addUrl(StringEscapeUtils.escapeXml(URLEncoder.encode(countryName+"/"+twoCharFile.getRelativeFileName().toLowerCase(), "UTF-8")));`? – Adam Lynch May 23 '11 at 11:46
  • @Adam - no, you can't just pass a path part through this method - forward slashes will be encoded and spaces will be encoded incorrectly. This method is only useful for URIs when encoding query parameters for servers that expect them. – McDowell May 23 '11 at 11:58
  • @McDowell hmm ok so `siteMap.addUrl(StringEscapeUtils.escapeXml(countryName+"/"+URLEncoder.encode(tw‌​oCharFile, "UTF-8").getRelativeFileName().toLowerCase()));` would be correct I take it? (`twoCharFile` would be the `ã¾` for example) – Adam Lynch May 23 '11 at 12:00
  • 1
    McDowell is correct. This for parameters mostly. I still suggest you try a few combinations of both xml escaping and urlencoding. (feel running one over the other might corrupt the entire string, so you may have to see which parts need xml encoding, and which path need this solution) – Jai May 23 '11 at 12:02
  • @McDowell @Jai But does the `%` need to be escaped (for XML)? – Adam Lynch May 23 '11 at 13:24
  • Don't do this. Use `java.net.URI.create(url).toASCIIString` – Lasf Oct 10 '22 at 15:20
2

URLs must be percent-encoded as per the URI spec.

For example, the code point U+00e3 (ã) would become the encoded sequence %C3%A3.

When a URI is emitted in an XML document, it must conform to the markup requirements for XML.

For example, the URI http://foo/bar?a=b&x=%C3%A3 becomes http://foo/bar?a=b&amp;x=%C3%A3. The ampersand is an escape character in XML.

You can find a detailed discussion of URI encoding here.

Community
  • 1
  • 1
McDowell
  • 107,573
  • 31
  • 204
  • 267
2

Don't confuse percentage encoding of non-ASCII characters in URLs with XML entity escapes of characters in URLs. You need to do both when creating XML sitemaps.

In honesty from reading your original post, it seems something funky is going on because the characters you mention remind me of when an unsuccessful conversion has taken place :)

Are you sure those characters truly are part of your URLs when using UTF-8?

Tom
  • 3,587
  • 9
  • 69
  • 124
  • `In honesty from reading your original post, it seems something funky is going on because the characters you mention remind me of when an unsuccessful conversion has taken place`. You are right. But I've script ready to go through the DB and clean that up. But still there's a problem with the encoding too. So if I had those characters, do I need to percentage-encode those characters alone and then escape the result for XML (w/ entities)? – Adam Lynch May 27 '11 at 16:22
  • 1) Convert document to UTF-8 2) Percentage encode all non-ASCII chars 3) Convert & to & < to < etc. – Tom May 28 '11 at 11:40
  • I've step one done. And I know how to do step 2 but does % need to be escaped? – Adam Lynch May 28 '11 at 16:04
1

All non-ascii characters in URL has to be 'x-url-encoding' encoded.

Here is the wiki link that explains it: http://en.wikipedia.org/wiki/Percent-encoding.

In addition all XML special symbols (&, >, <, etc.) also have to be escaped.

Jai's answer shows the correct method to x-url-encode arbitrary string. Note, however, that it does not do XML escaping.

Community
  • 1
  • 1
Alexander Pogrebnyak
  • 44,836
  • 10
  • 105
  • 121
  • Instead of percent-encoding, punycode is also a possibility: http://tools.ietf.org/html/rfc3492 – Residuum May 23 '11 at 11:43
  • I've added a snippet of the XML file. Is both of your answers still applicable? – Adam Lynch May 23 '11 at 11:48
  • @Adam. Still applies, as your resulting URL is not x-url-encoded. Also, because x-url-encoding is not a trivial operation, I highly recommend keeping URL parts in plain ASCII. I don't know what the requirements are for you system, but could you, possibly, rename the file to listing-20110523.shtml ( or similar along those lines )? This way you don't even have to bother with encoding of your URLs. – Alexander Pogrebnyak May 23 '11 at 12:11
  • No not really possible. We have a big big system done this way. – Adam Lynch May 23 '11 at 12:17