Questions tagged [anemone]

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/

38 questions
0
votes
0 answers

break statement in loop not working

I am new to anemone gem. I have written the following code: anemone.on_every_page do |page| if page.url.to_s.match(/\-ad$/) unless page.url.to_s.match("restaurant|hotel") p "not useful url: #{page.url}" count +=…
Joy
  • 4,197
  • 14
  • 61
  • 131
0
votes
0 answers

Ruby open_uri always 404. (allow https redirects git version)

I'm using the open-uri module which allows https redirects. What I'm trying to do is open every page from a domain. I do this by first crawling it through anemone: require 'anemone' require "./open_uri" class Query def initialize() fs =…
Bula
  • 2,398
  • 5
  • 28
  • 54
0
votes
1 answer

web crawler in rails,how to crawl all pages of the site

I need to get all urls from all pages of the given domain, I think it make sense to use background jobs, placing them on multiple queues trying to use cobweb but it seems very confusing gem, and anomone, anemone is working for a long time if…
Aydar Omurbekov
  • 2,047
  • 4
  • 27
  • 53
0
votes
2 answers

Getting all URLs using anemone gem (very large site)

The site I want to index is fairly big, 1.x million pages. I really just want a json file of all the URLs so I can run some operations on them (sorting, grouping, etc). The basic anemome loop worked well: require…
mustacheMcGee
  • 481
  • 6
  • 19
0
votes
1 answer

How to handle NILs with Anemone / Nokogiri web scraper?

def scrape!(url) Anemone.crawl(url) do |anemone| anemone.on_pages_like %[/events/detail/.*] do |page| show = { headliner: page.doc.at_css('h1.summary').text, openers: page.doc.at_css('.details h2').text …
GN.
  • 8,672
  • 10
  • 61
  • 126
0
votes
1 answer

anemone print links on first page

wanted to see what i was doing wrong. here. I need to print the links on the parent page, even they are for another domain. And get out. require 'anemone' url = ARGV[0] Anemone.crawl(url, :depth_limit => 1) do |anemone| anemone.on_every_page do…
tven
  • 547
  • 6
  • 18
-1
votes
1 answer

How to scrape products from site with ruby/anemone/nokogiri

Is it possible to scrape the products from a ecommerce site using the anemone and nokogiri libs in ruby? I understand how to pull the data I need from each product page using nokogiri but I can't figure out how to make anemone/nokogiri crawl the…
Dan
  • 641
  • 9
  • 25
-2
votes
1 answer

Anemone - NoMethodError: undefined method `xpath' for nil:NilClass

I'm just starting to learn more about writing a web crawler in Ruby which is designed to crawl my blog and find broken external links using the Anemone gem and the rake task below... task :testing_this => :environment do require 'anemone' …
1 2
3