Get all urls of a website

Question

I wanna build a tool which scans a website for all urls, but not the urls within the page but of the site self, but I don't know how. Could anyone give me an example how I can start?

Example: www.localhost.dev
     /upload
     /login
     /impress

Not every page has to be linked from another page of that domain. Scanning html, only would be useless. Or another example I want to generate a sitemap.xml.

Thanks

Do you do this "from the outside", or is that actually your host? (and you have access to the filesystem) — MichelZ, Apr 11 '14 at 12:43

score 4 · Answer 1 · answered Apr 11 '14 at 12:55

4

What are you really trying to accomplish?

You're simply not going to be able to do this via HTTP. Given the absence of vulnerabilities in the HTTP server, you're going to get what the content provider publishes unless you already know direct paths. The only option here is a content crawler.

With that fact in hand your other option is to index the site at the file system level. You will have to do a lot of work analyzing the files since there will most likely be a significant amount of files that don't translate to a URL on the server.

answered Apr 11 '14 at 12:55

squillman

37,883
12
92
146

ok, thanks, than my ideas are completely for the trash – chunk0r Apr 11 '14 at 12:58
1

@chunk0r: Don't forget the fact that it's quite possible to have a practical unlimited amount of valid URLs on a server, and you don't need more than one HTML file and a rewrite rule for this ... – Sven Apr 11 '14 at 13:07
2

@chunk0r And you're probably getting voted down to the cellar because without clear explanation of what you're trying to get done this could be used for malicious purposes. I'm not saying this is your goal, it could just be interpreted that way. – squillman Apr 11 '14 at 13:11
@chunk0r I didn't downvote because I came in and read the comments, but that's how I read it at first. – Katherine Villyard Apr 11 '14 at 14:06

score 2 · Answer 2 · answered Apr 11 '14 at 12:44

2

As far as I know this is impossible. Sometimes admins turn on directory indexes, but any directory that contains an index.html page will just show the HTML page instead of the directory index.

answered Apr 11 '14 at 12:44

mtak

581
4
11

Get all urls of a website

2 Answers2