Using wget to get count of pages below a link?

Question

I've been using a sitemapping tool to get a simple count of links below a specific url. The free trial has ended, so I figure that rather than paying $70 for what is very simple functionality, I should just use wget.

Here's what I have so far: wget --spider --recursive http://url.com/

I'm not sure, however, how to somehow calculate the number of links found from this. I'm also slightly nervous that this is doing what I want it to - will this only get links below the domain of url.com?

Any ideas on how to accomplish this?

Thanks.

Did this get downvoted because this question is not appropriate for server fault? If so, which stack exchange site would it be appropriate for? I feel that it's a fair question, and I'm still looking for a satisfactory answer. — rybosome, Dec 28 '11 at 15:05
You could try [Pro WebMasters](http://webmasters.stackexchange.com/) but DO NOT repost it there yet, it is likely that this will be migrated. — tombull89, Dec 28 '11 at 23:00

Tom O'Connor · Answer 1 · 2011-12-28T23:16:02.600

3

sudo apt-get install lynx-cur


lynx --dump http://serverfault.com -listonly |head
   1. http://serverfault.com/opensearch.xml
   2. http://serverfault.com/feeds
   3. http://stackexchange.com/
   4. http://serverfault.com/users/login
   5. http://careers.serverfault.com/
   6. http://blog.serverfault.com/
   7. http://meta.serverfault.com/
   8. http://serverfault.com/about
   9. http://serverfault.com/faq
  10. http://serverfault.com/

And so on.

Edit: For the lazy OP.

tom@altoid ~ $ lynx -dump -nonumbers -listonly http://serverfault.com|egrep -v "^$"|egrep -v "(Visible|Hidden) links"| while read link; do echo -n "$link   :" ;curl -I -s $link |grep HTTP; done
Visible links   :HTTP/1.1 200 OK
HTTP/1.1 200 OK
http://serverfault.com/opensearch.xml   :HTTP/1.1 200 OK
http://serverfault.com/feeds   :HTTP/1.1 200 OK
http://stackexchange.com/   :HTTP/1.1 200 OK
http://serverfault.com/users/login   :HTTP/1.1 200 OK
http://careers.serverfault.com/   :HTTP/1.1 302 Found
http://blog.serverfault.com/   :HTTP/1.1 200 OK

Better?!

edited Dec 28 '11 at 23:16

answered Dec 28 '11 at 00:09

Tom O'Connor

27,480
10
73
148

This won't recursively descend a site. `lynx -help` on the option `-dump` says the following: 'dump the first file to stdout and exit'. Even using the option `-traversal`, this will exit after dumping links from the homepage. – rybosome Dec 28 '11 at 15:14
If you can't figure out how to pipeline that list of links, then i'm afraid there's little anyone here can help you with. – Tom O'Connor Dec 28 '11 at 16:59
So, you're talking about piping the output of this into another run of lynx? Great, two levels of depth, not accounting for repeated links. So to fully spider the site, I'd end up needing to write a shell script to continually call this. At that point, I may as well just modify the Ruby crawler I've already written to only count links instead of scrape content. – rybosome Dec 28 '11 at 20:53
@Ryan See edited answer. It's really easy. – Tom O'Connor Dec 28 '11 at 23:08
1

I appreciate your help, but I'm not sure what benefit you get by being rude. – rybosome Dec 29 '11 at 14:58
It's fun. It's christmas. I'm British, and snarky. – Tom O'Connor Dec 29 '11 at 22:40

Using wget to get count of pages below a link?

1 Answers1