-8

I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don't know how. Could anyone give me an example how I can start?

Example: www.localhost.dev

     /upload
     /login
     /impress

Not every page has to be linked from another page of that domain. Scanning html, only would be useless. Or another example I want to generate a sitemap.xml.

Thanks

chunk0r
  • 11
  • 4

2 Answers2

4

What are you really trying to accomplish?

You're simply not going to be able to do this via HTTP. Given the absence of vulnerabilities in the HTTP server, you're going to get what the content provider publishes unless you already know direct paths. The only option here is a content crawler.

With that fact in hand your other option is to index the site at the file system level. You will have to do a lot of work analyzing the files since there will most likely be a significant amount of files that don't translate to a URL on the server.

squillman
  • 37,883
  • 12
  • 92
  • 146
  • ok, thanks, than my ideas are completely for the trash – chunk0r Apr 11 '14 at 12:58
  • 1
    @chunk0r: Don't forget the fact that it's quite possible to have a practical unlimited amount of valid URLs on a server, and you don't need more than one HTML file and a rewrite rule for this ... – Sven Apr 11 '14 at 13:07
  • 2
    @chunk0r And you're probably getting voted down to the cellar because without clear explanation of what you're trying to get done this could be used for malicious purposes. I'm not saying this is your goal, it could just be interpreted that way. – squillman Apr 11 '14 at 13:11
  • @chunk0r I didn't downvote because I came in and read the comments, but that's how I read it at first. – Katherine Villyard Apr 11 '14 at 14:06
2

As far as I know this is impossible. Sometimes admins turn on directory indexes, but any directory that contains an index.html page will just show the HTML page instead of the directory index.

mtak
  • 581
  • 4
  • 11