Anyone know of an open source web spider?

Question

I'm looking for a web spider that will collect all the links it sees, save those to a file, and then index those after finishing the others it has indexed. It doesn't have to have a pretty UI or really anything. As long as it can jump from website to website. It can be in any language as well, however, don't suggest Nutch.

Nutch is built for Linux, I run Windows. I also don't have a cluster of servers. Other than those things, nothing. — Noah R, Mar 15 '11 at 22:57

score 1 · Answer 1 · answered Mar 15 '11 at 22:51

1

I like NCrawler, but it requires some .NET skills.

It's easy to start with and easy to extend. Have a look!

answered Mar 15 '11 at 22:51

Mikael Östberg

16,982
6
61
79

score 0 · Accepted Answer · answered Mar 15 '11 at 22:54

0

wget will spider sites, is really configurable and is open source. It is written in C.

Not sure it will spit out a list of links, however it will save all files it runs across, which can easily then be converted to a list of links.

answered Mar 15 '11 at 22:54

Matthew Scharley

127,823
52
194
222

Is there a .exe of wget? I run Windows. – Noah R Mar 15 '11 at 22:56
@Noah Yes, or you can definitely get it as part of cygwin (along with a ton of other *NIX type applications). http://gnuwin32.sourceforge.net/packages/wget.htm – Matthew Scharley Mar 15 '11 at 22:57
Alright, I installed the .exe and so now how do I use it? – Noah R Mar 15 '11 at 22:59
@Noah You can read the manual here: http://www.gnu.org/software/wget/manual/wget.html . The short version though is that it's a console application which means you'll need to run it from `cmd.exe`. The commandline you are looking for is something along the lines of `wget -r http://example.com/`. – Matthew Scharley Mar 15 '11 at 23:01

Anyone know of an open source web spider?

2 Answers2