I recently got interested in web crawlers but one thing isn't a very clear one to me. Imagine a simple crawler that would get the page, extract links from it and queue them for later processing the same way.
How crawlers handle the case when certain link wouldn't lead to another page but to some asset or maybe other kind of static file instead? How would it know? It probably doesn't want to download this kind of maybe large binary data, nor even xml or json files. How content negotiation fall into this?
How I see content negotiation should work is on the webserver side when I issue a request to example.com/foo.png
with Accept: text/html
it should send me back an html response or Bad Request status if it cannot satisfy my requirements, nothing else is acceptable, but that's not how it works in the real life. It send me back that binary data anyway with Content-Type: image/png
even when I'm telling it I only accept text/html
. Why webservers work like this and not coercing the right response I'm asking for?
Is implementation of content negotiation broken or it's application's responsibility to implement it correctly?
And how does real crawlers work? Sending HEAD request ahead to check whats on the other side of a link sees as an unpractical waste of resources.