web crawler in rails,how to crawl all pages of the site

Question

I need to get all urls from all pages of the given domain,
I think it make sense to use background jobs, placing them on multiple queues
trying to use cobweb but it seems very confusing gem,
and anomone, anemone is working for a long time if there are a lot of pages

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.links
  end
end

What do u think would fit me best?

Do you need to generate Sitemap for your Site ? – ajknzhol Oct 11 '13 at 12:01 — ajknzhol, Oct 11 '13 at 12:01
@AjayKumar No, I just need to get all links from the site – Aydar Omurbekov Oct 11 '13 at 12:54 — Aydar Omurbekov, Oct 11 '13 at 12:54
wget httrack and there many other spider you can use – Viren Oct 13 '13 at 04:35 — Viren, Oct 13 '13 at 04:35

score 2 · Answer 1 · answered Oct 11 '13 at 13:19

2

You can use Nutch Crawler, Apache Nutch is a highly extensible and scalable open source web crawler software project.

answered Oct 11 '13 at 13:19

ajknzhol

6,322
13
45
72

I think cobweb fit me best – Aydar Omurbekov Oct 11 '13 at 16:38

web crawler in rails,how to crawl all pages of the site

1 Answers1