2

I'm looking for a web spider that will collect all the links it sees, save those to a file, and then index those after finishing the others it has indexed. It doesn't have to have a pretty UI or really anything. As long as it can jump from website to website. It can be in any language as well, however, don't suggest Nutch.

Qantas 94 Heavy
  • 15,750
  • 31
  • 68
  • 83
Noah R
  • 5,287
  • 21
  • 56
  • 75

2 Answers2

1

I like NCrawler, but it requires some .NET skills.

It's easy to start with and easy to extend. Have a look!

Mikael Östberg
  • 16,982
  • 6
  • 61
  • 79
0

wget will spider sites, is really configurable and is open source. It is written in C.

Not sure it will spit out a list of links, however it will save all files it runs across, which can easily then be converted to a list of links.

Matthew Scharley
  • 127,823
  • 52
  • 194
  • 222
  • Is there a .exe of wget? I run Windows. – Noah R Mar 15 '11 at 22:56
  • @Noah Yes, or you can definitely get it as part of cygwin (along with a ton of other *NIX type applications). http://gnuwin32.sourceforge.net/packages/wget.htm – Matthew Scharley Mar 15 '11 at 22:57
  • Alright, I installed the .exe and so now how do I use it? – Noah R Mar 15 '11 at 22:59
  • @Noah You can read the manual here: http://www.gnu.org/software/wget/manual/wget.html . The short version though is that it's a console application which means you'll need to run it from `cmd.exe`. The commandline you are looking for is something along the lines of `wget -r http://example.com/`. – Matthew Scharley Mar 15 '11 at 23:01