0

I have written in C# application to crawl websites. Now I have a problem I can identify that this URL leads to a file or a webpage! How can I solve this problem without having to send the requested URL?

Reza Aghaei
  • 120,393
  • 18
  • 203
  • 398
Amirali Eshghi
  • 963
  • 1
  • 14
  • 21
  • 3
    URLs serve content. What do you mean by "file" vs. "web page"? Are you actually asking for the `Content-Type` header for the response? – SLaks May 03 '16 at 17:18
  • 1
    Which client are you using? You should be able to make a HEAD request to the url and examine the content-type in the response headers. – Lee May 03 '16 at 17:23
  • 1
    `"How can I solve this problem without having to send the requested URL?"` - You can't. A URL by itself is just an address. It doesn't provide any information about what is *at* that address, it just tells you where to look for something. The web server at that address can return *anything*. You'd have to make *some* request (minimally a `HEAD` request) to get more information about the content at that address. – David May 03 '16 at 17:28

3 Answers3

4

You can't without sending a request... As Uniform Resource Locator is not comparable to a File System Path. For instance, while the following url ends with a .jpg, it is clearly not a picture :

google.com/search?q=asd.jpg

Here is how, if you decided to change mind :

public bool IsFileContent(string url)
{
    var request = HttpWebRequest.Create(url);
    request.Method = "HEAD";

    switch (request.GetResponse().ContentType)
    {
        case "image/jpeg": return true;
        case "text/plain": return true;
        case "text/html": return false;

        default: // TODO: add more case as needed
            throw new ArgumentOutOfRangeException();
    }
}
Xiaoy312
  • 14,292
  • 1
  • 32
  • 44
2

What you are asking to do is literally impossible. URLs do not 'lead to files or web pages.' They are routed to request handlers. A request handler can return an HTML response or a file download or other types of responses. Some extensions such as ".html" or ".pdf" imply what the type of response should be. But a URL could have an extension that doesn't indicate the response type, or (as on this very page) no extension at all.

You cannot determine the response type of an HTTP request from the URL alone.

Scott Hannen
  • 27,588
  • 3
  • 45
  • 62
-1

Without sending any request the only thing I could think of is to check for a file extention at the end of url. This won't give you a 100% success rate, because you can send a file using a url that doesn't end on a extension. That being said it is common practice to let a file url end on the filename with the extension

Tom Droste
  • 1,324
  • 10
  • 14
  • File extensions are entirely meaningless on URLs. HTTP isn't a file system. – David May 03 '16 at 17:24
  • @David No, but most of the the uploads that are made to a webserver or files that exists on a webserver available for download do end in a file extension (.pdf, .jpg, etc). – Tom Droste May 03 '16 at 17:25
  • So what would be the type of: `/DownloadFile.aspx?fileID=123` Or: `/Files/123` Or: `/FindFiles.aspx?searchText=*.jpg` – David May 03 '16 at 17:39
  • Like I said, it won't give a 100% success rate... but it was the only thing I could think of that would give you an idea without sending any kind of request. – Tom Droste May 03 '16 at 17:53