Questions tagged [anemone]

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/

38 questions
2
votes
2 answers

How to handle 500 Internal Server Error and 404 Page Not Found with Anemone, Boilerpipe and Nokigiri

I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc. My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail…
Hugo Sousa
  • 906
  • 2
  • 9
  • 27
2
votes
1 answer

HTTP Basic Authentication with Anemone Web Spider

I need collect all "title" from all pages from site. Site have HTTP Basic Auth configuration. Without auth I do next: require 'anemone' Anemone.crawl("http://example.com/") do |anemone| anemone.on_every_page do |page| puts…
Sergey Blohin
  • 600
  • 1
  • 4
  • 31
2
votes
1 answer

How to deserialize BSON::Binary back into ruby hash?

I'm using Anemone to store crawled pages into MongoDB. It mostly works, except for accessing the page headers when I retrieve a page from MongoDB. When I call collection.find_one("http://stackoverflow.com") I'll get the correct object from the data…
2
votes
1 answer

Rails Anemone and Postgres storing just the URL

I want to save the URL on_pages_like a certain match. Anemone is doing its thing, and records are being created that store the URLs, but: I want to use something like find_or_create_by_url instead of create!, so I'm not duplicating records each…
Michael Emond
  • 131
  • 10
2
votes
1 answer

How to crawl only a subfolder with Anemone

We can crawl a hole website with anemone (ex: https://stackoverflow.com/), but what if I want only focus on a certain folder (ex: https://stackoverflow.com/questions). How can I do this ? maybe with the "focus_crawl" method ?
Ghilas BELHADJ
  • 13,412
  • 10
  • 59
  • 99
2
votes
1 answer

Can Anemone crawl html files stored locally on my hard drive?

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before…
jengman cd
  • 25
  • 2
1
vote
1 answer

Anemone with Rails and MongoDB

I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results,…
Micah Alcorn
  • 2,363
  • 2
  • 22
  • 45
1
vote
2 answers

Using Ruby's Anemone Gem to Scrape All Email Addresses From a Site

I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email…
HMLDude
  • 1,547
  • 7
  • 27
  • 47
1
vote
1 answer

Writing the output of loop into a text file from a Ruby web crawler gem

I'm a complete Ruby noob, currently going through the Treehouse tutorials but I need some quick help for outputting the content of an Anemone crawl into a text file for my job(I'm an SEO). How do I get the following to dump it's output into a text…
1
vote
1 answer

Not able to access page data, using anemone with socksify gem and Tor

I ve written a ruby script using anemone gem to crawl a website. The script runs fine when used directly. But I would like to use socksify gem so that all TCP calls from the script is routed with socks5. I did the following for the same: Installed…
buddy
  • 189
  • 2
  • 16
1
vote
1 answer

Heroku H12 Request timeout when running Ruby Anemone

I have a Ruby app hosted on Heroku that runs Anemone (Ruby web spider / crawler) on user-specified domains. When the user picks a medium-to-large sized domain, it crashes and the logs show an H12 error (Request timeout). This is because Anemone…
dbuss1
  • 82
  • 4
1
vote
2 answers

When to use 'http://' or 'http://www.' when scraping?

I am scraping a small number of sites with the ruby anemone gem. Anemone.crawl("http://www.somesite.com") do |anemone| anemone.on_every_page do |page| ... end end Depending on the site, some require 'www' to be present…
Jackson Henley
  • 1,531
  • 2
  • 15
  • 27
1
vote
3 answers

Ruby scraper. How to export to CSV?

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being…
Dan
  • 641
  • 9
  • 25
0
votes
1 answer

Anemone Crawler skip_links_like not obeyed

I am using Anemone to crawl a massive site that to make things worse has the same content on a few different language versions. There is domain.com/ for the main language and domain.com/de/, domain.com/es/ for the other languages so I decided to…
Killerpixler
  • 4,200
  • 11
  • 42
  • 82
0
votes
1 answer

character class has duplicated range:/ regular expression of email/

result of xmpfilter doc.search('.noimage p:nth-child(5)') do |kaipan| x = kaipan.to_s x.scan(/[\w\d_-]+@[\w\d_-]+\.[\w\d._-]+/) #=> # !> character class has duplicated range: /[\w\d_-]+@[\w\d_-]+\.[\w\d._-]+/ end If I don't use…