173

What are the correct content-types for XML, HTML and XHTML documents?

I need to write a simple crawler that only fetches these kinds of files.

Nowadays http://example.net/index.html can serve for example a JPEG file due to mod_rewrite, so I need to check the content-type from response header and compare it with a list of allowed content-types.

Where can I get such a list from?

Tomáš Zato
  • 50,171
  • 52
  • 268
  • 778
astropanic
  • 10,800
  • 19
  • 72
  • 132

1 Answers1

293

HTML: text/html, full-stop.

XHTML: application/xhtml+xml, or only if following HTML compatbility guidelines, text/html. See the W3 Media Types Note.

XML: text/xml, application/xml (RFC 2376).

There are also many other media types based around XML, for example application/rss+xml or image/svg+xml. It's a safe bet that any unrecognised but registered ending in +xml is XML-based. See the IANA list for registered media types ending in +xml.

(For unregistered x- types, all bets are off, but you'd hope +xml would be respected.)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 41
    On differences between `text/xml` and `application/xml` see here http://stackoverflow.com/questions/4832357/whats-the-difference-between-text-xml-vs-application-xml-for-webservice-respons – sanmai Jul 10 '14 at 08:40
  • The same is valid for *fragments*, see http://w3.org/TR/xml-fragment or [this other qustion](http://stackoverflow.com/q/19303361/287948). – Peter Krauss May 10 '16 at 19:21