4

I recently got interested in web crawlers but one thing isn't a very clear one to me. Imagine a simple crawler that would get the page, extract links from it and queue them for later processing the same way.

How crawlers handle the case when certain link wouldn't lead to another page but to some asset or maybe other kind of static file instead? How would it know? It probably doesn't want to download this kind of maybe large binary data, nor even xml or json files. How content negotiation fall into this?

How I see content negotiation should work is on the webserver side when I issue a request to example.com/foo.png with Accept: text/html it should send me back an html response or Bad Request status if it cannot satisfy my requirements, nothing else is acceptable, but that's not how it works in the real life. It send me back that binary data anyway with Content-Type: image/png even when I'm telling it I only accept text/html. Why webservers work like this and not coercing the right response I'm asking for?

Is implementation of content negotiation broken or it's application's responsibility to implement it correctly?

And how does real crawlers work? Sending HEAD request ahead to check whats on the other side of a link sees as an unpractical waste of resources.

Kreeki
  • 3,662
  • 6
  • 27
  • 33

2 Answers2

5

Not 'Bad Request', the correct response is 406 Not Acceptable.

The HTTP spec states that it SHOULD send back this spec[1], but most implementations don't do this. If you want to avoid download a content-type you're not interested in, your only options is indeed to do a HEAD first. Since you probably crawled these images, you may also be able to make some intelligent guesses that it was in fact an image (for instance, it appeared in an <img> tag).

You could also just start the request as normally, and as soon as you notice that you're getting binary data back, cut the TCP connection short. But I'm not sure how good of an idea this is.

Community
  • 1
  • 1
Evert
  • 93,428
  • 18
  • 118
  • 189
0

Crawlers have to always be on the lookout for bad info: some sites have a 10 megabyte movie named /robots.txt. Even if content negotiation was actually implemented in webservers, plenty of webservers have incorrect content types configured, plenty of files have the wrong extension, and the start of a file being reasonable doesn't mean that it doesn't turn into binary and isn't huge.

Greg Lindahl
  • 477
  • 3
  • 13