Questions tagged [anemone]

Anemone is a Ruby library that makes it quick and painless to write programs that spider a website. It provides a simple DSL for performing actions on every page of a site, skipping certain URLs, and calculating the shortest path to a given page on a site. The multi-threaded design makes Anemone fast. The API makes it simple. And the expressiveness of Ruby makes it powerful.

http://anemone.rubyforge.org/

38 questions

votes

2 answers

How to handle 500 Internal Server Error and 404 Page Not Found with Anemone, Boilerpipe and Nokigiri

I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc. My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail…

asked Sep 02 '13 at 20:35

Hugo Sousa

votes

1 answer

HTTP Basic Authentication with Anemone Web Spider

I need collect all "title" from all pages from site. Site have HTTP Basic Auth configuration. Without auth I do next: require 'anemone' Anemone.crawl("http://example.com/") do |anemone| anemone.on_every_page do |page| puts…

ruby web-crawler anemone

asked May 30 '13 at 21:22

Sergey Blohin

votes

1 answer

How to deserialize BSON::Binary back into ruby hash?

I'm using Anemone to store crawled pages into MongoDB. It mostly works, except for accessing the page headers when I retrieve a page from MongoDB. When I call collection.find_one("http://stackoverflow.com") I'll get the correct object from the data…

ruby mongodb anemone

asked May 23 '13 at 20:21

Cole Fichter

votes

1 answer

Rails Anemone and Postgres storing just the URL

I want to save the URL on_pages_like a certain match. Anemone is doing its thing, and records are being created that store the URLs, but: I want to use something like find_or_create_by_url instead of create!, so I'm not duplicating records each…

ruby-on-rails ruby postgresql anemone

asked Nov 01 '12 at 13:38

Michael Emond

votes

1 answer

How to crawl only a subfolder with Anemone

We can crawl a hole website with anemone (ex: https://stackoverflow.com/), but what if I want only focus on a certain folder (ex: https://stackoverflow.com/questions). How can I do this ? maybe with the "focus_crawl" method ?

ruby web-crawler anemone

asked Aug 08 '12 at 16:12

Ghilas BELHADJ

13,412
10
59
99

votes

1 answer

Can Anemone crawl html files stored locally on my hard drive?

I'm hoping to scrape together several tens of thousand pages of government data (in several thousand folders) that are online and put it all into a single file. To speed up the process, I figured I'd download the site first to my hard drive before…

ruby-on-rails ruby web-crawler scrape anemone

asked May 31 '12 at 16:48

jengman cd

vote

1 answer

Anemone with Rails and MongoDB

I am preparing to write my first web crawler, and it looks like Anemone makes the most sense. There is built in support for MongoDB storage, and I am already using MongoDB via Mongoid in my Rails application. My goal is to store the crawled results,…

mongodb ruby-on-rails-3.1 mongoid web-crawler anemone

asked Feb 24 '12 at 06:10

Micah Alcorn

2,363
2
22
45

vote

2 answers

Using Ruby's Anemone Gem to Scrape All Email Addresses From a Site

I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email…

ruby anemone

asked Apr 20 '17 at 00:29

HMLDude

1,547
7
27
47

vote

1 answer

Writing the output of loop into a text file from a Ruby web crawler gem

I'm a complete Ruby noob, currently going through the Treehouse tutorials but I need some quick help for outputting the content of an Anemone crawl into a text file for my job(I'm an SEO). How do I get the following to dump it's output into a text…

ruby-on-rails ruby anemone

asked Oct 03 '13 at 09:57

Sleeparchive

vote

1 answer

Not able to access page data, using anemone with socksify gem and Tor

I ve written a ruby script using anemone gem to crawl a website. The script runs fine when used directly. But I would like to use socksify gem so that all TCP calls from the script is routed with socks5. I did the following for the same: Installed…

ruby scrape tor anemone

asked Sep 08 '13 at 05:19

buddy

vote

1 answer

Heroku H12 Request timeout when running Ruby Anemone

I have a Ruby app hosted on Heroku that runs Anemone (Ruby web spider / crawler) on user-specified domains. When the user picks a medium-to-large sized domain, it crashes and the logs show an H12 error (Request timeout). This is because Anemone…

ruby heroku anemone request-timed-out

asked Aug 01 '13 at 20:42

dbuss1

vote

2 answers

When to use 'http://' or 'http://www.' when scraping?

I am scraping a small number of sites with the ruby anemone gem. Anemone.crawl("http://www.somesite.com") do |anemone| anemone.on_every_page do |page| ... end end Depending on the site, some require 'www' to be present…

ruby-on-rails ruby web-crawler anemone

asked Jun 15 '13 at 04:41

Jackson Henley

1,531
2
15
27

vote

3 answers

Ruby scraper. How to export to CSV?

I wrote this ruby script to scrape product info from the manufacturer website. The scraping and storage of the product objects in an array works, but I can't figure out how to export the array data to a csv file. This error is being…

ruby fastercsv scraper anemone

asked May 21 '12 at 02:59

Dan

votes

1 answer

Anemone Crawler skip_links_like not obeyed

I am using Anemone to crawl a massive site that to make things worse has the same content on a few different language versions. There is domain.com/ for the main language and domain.com/de/, domain.com/es/ for the other languages so I decided to…

ruby anemone

asked Oct 19 '16 at 14:21

Killerpixler

4,200
11
42
82

votes

1 answer

character class has duplicated range:/ regular expression of email/

result of xmpfilter doc.search('.noimage p:nth-child(5)') do |kaipan| x = kaipan.to_s x.scan(/[\w\d_-]+@[\w\d_-]+\.[\w\d._-]+/) #=> # !> character class has duplicated range: /[\w\d_-]+@[\w\d_-]+\.[\w\d._-]+/ end If I don't use…

ruby regex web-scraping mechanize anemone

asked Sep 14 '14 at 05:22

Sintaro0221

Prev 1

3 Next