I'm looking for a web spider that will collect all the links it sees, save those to a file, and then index those after finishing the others it has indexed. It doesn't have to have a pretty UI or really anything. As long as it can jump from website to website. It can be in any language as well, however, don't suggest Nutch.
Asked
Active
Viewed 1,213 times
2
-
Out of curiosity, whats wrong with Nutch? – vicTROLLA Mar 15 '11 at 22:52
-
Nutch is built for Linux, I run Windows. I also don't have a cluster of servers. Other than those things, nothing. – Noah R Mar 15 '11 at 22:57
2 Answers
1
I like NCrawler, but it requires some .NET skills.
It's easy to start with and easy to extend. Have a look!

Mikael Östberg
- 16,982
- 6
- 61
- 79
0
wget
will spider sites, is really configurable and is open source. It is written in C.
Not sure it will spit out a list of links, however it will save all files it runs across, which can easily then be converted to a list of links.

Matthew Scharley
- 127,823
- 52
- 194
- 222
-
-
@Noah Yes, or you can definitely get it as part of cygwin (along with a ton of other *NIX type applications). http://gnuwin32.sourceforge.net/packages/wget.htm – Matthew Scharley Mar 15 '11 at 22:57
-
-
@Noah You can read the manual here: http://www.gnu.org/software/wget/manual/wget.html . The short version though is that it's a console application which means you'll need to run it from `cmd.exe`. The commandline you are looking for is something along the lines of `wget -r http://example.com/`. – Matthew Scharley Mar 15 '11 at 23:01