2

I have to develop a Ruby on Rails application which fetches all the images, pdf, cgi, etc. file extension links from web page.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Aniruddhsinh
  • 2,099
  • 3
  • 15
  • 19

3 Answers3

7

The easiest way to grab links from pages is to use URI.extract. From the docs:

Description

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.

Usage

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test@example.com"]

Looking at this page:

require 'open-uri'
require 'uri'

html = open('http://stackoverflow.com/questions/8722693/how-to-get-all-image-pdf-and-other-files-links-from-a-web-page/8724632#8724632').read

puts URI.extract(html).select{ |l| l[/\.(?:gif|png|jpe?g)\b/]}

which returns:

http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://sstatic.net/stackoverflow/img/apple-touch-icon.png
http://foobar.com/path/to/file.gif?some_query=1
http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Thanks for such clear and lucid explanation. but when i try [http://testasp.vulnweb.com](http://testasp.vulnweb.com) it doesnt return anything :( – Aniruddhsinh Jan 04 '12 at 12:31
  • There is one image in the page and it uses a relative `src`, which is why `URI.extract` can't find it. It expects a true URL. Nokogiri makes it easy to extract only the image tags: `doc.search('img')` will return a list of `img` nodes. Extract the `src` attribute and you're done. – the Tin Man Jan 04 '12 at 18:35
  • Can you please help me in correcting the above code that is how Can I make that relative url to absolute url and I do have to search all the file extensions like pdf,css and etc.. so I am not using `doc.Searcj('img')` of nokogiri – Aniruddhsinh Jan 06 '12 at 06:29
4

Have you tried the following tutorials to learn how to parse a web page first:

Also, just as a note, be careful what sites you parse. It seems like getting all those PDF, images, etc. might be noticed by the site you are trying to parse. I learned the hard way.

Sometimes you might be able to get info from feeds. Try this:

Feed Parsing

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Hishalv
  • 3,052
  • 3
  • 29
  • 52
  • Thanks :) Have you any idea that how any site-admin/web-master can detect/notice that such files are parsed? – Aniruddhsinh Jan 04 '12 at 09:11
  • @user1027702 not sure i think they blacklist your ip or something. A while back i tried to scrape info from a site and next thing i could not access the site. Its best always to check with the webmaster to see if it is ok to scrape info. – Hishalv Jan 04 '12 at 09:18
3

Forget Net::HTTP, Open::URI is much easier. Here's some code to get you started:

require 'nokogiri'
require 'open-uri'

url = 'http://www.google.com/'
doc = Nokogiri::HTML(open(url))
doc.traverse do |el|
    [el[:src], el[:href]].grep(/\.(gif|jpg|png|pdf)$/i).map{|l| URI.join(url, l).to_s}.each do |link|
        File.open(File.basename(link),'wb'){|f| f << open(link,'rb').read}
    end
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
pguardiario
  • 53,827
  • 19
  • 119
  • 159