I have to develop a Ruby on Rails application which fetches all the images, pdf, cgi, etc. file extension links from web page.
Asked
Active
Viewed 4,528 times
2
-
Begin with reading `Net::HTTP` doc. – Sergio Tulentsev Jan 04 '12 at 06:08
-
@SergioTulentsev thanks :) if you could let us know which method or function in Net::HTTP seems to much useful.. then it would be big help for me :) – Aniruddhsinh Jan 04 '12 at 06:40
3 Answers
7
The easiest way to grab links from pages is to use URI.extract
. From the docs:
Description
Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.
Usage
require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test@example.com"]
Looking at this page:
require 'open-uri'
require 'uri'
html = open('http://stackoverflow.com/questions/8722693/how-to-get-all-image-pdf-and-other-files-links-from-a-web-page/8724632#8724632').read
puts URI.extract(html).select{ |l| l[/\.(?:gif|png|jpe?g)\b/]}
which returns:
http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://sstatic.net/stackoverflow/img/apple-touch-icon.png
http://foobar.com/path/to/file.gif?some_query=1
http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif

the Tin Man
- 158,662
- 42
- 215
- 303
-
Thanks for such clear and lucid explanation. but when i try [http://testasp.vulnweb.com](http://testasp.vulnweb.com) it doesnt return anything :( – Aniruddhsinh Jan 04 '12 at 12:31
-
There is one image in the page and it uses a relative `src`, which is why `URI.extract` can't find it. It expects a true URL. Nokogiri makes it easy to extract only the image tags: `doc.search('img')` will return a list of `img` nodes. Extract the `src` attribute and you're done. – the Tin Man Jan 04 '12 at 18:35
-
Can you please help me in correcting the above code that is how Can I make that relative url to absolute url and I do have to search all the file extensions like pdf,css and etc.. so I am not using `doc.Searcj('img')` of nokogiri – Aniruddhsinh Jan 06 '12 at 06:29
4
Have you tried the following tutorials to learn how to parse a web page first:
Also, just as a note, be careful what sites you parse. It seems like getting all those PDF, images, etc. might be noticed by the site you are trying to parse. I learned the hard way.
Sometimes you might be able to get info from feeds. Try this:

the Tin Man
- 158,662
- 42
- 215
- 303

Hishalv
- 3,052
- 3
- 29
- 52
-
Thanks :) Have you any idea that how any site-admin/web-master can detect/notice that such files are parsed? – Aniruddhsinh Jan 04 '12 at 09:11
-
@user1027702 not sure i think they blacklist your ip or something. A while back i tried to scrape info from a site and next thing i could not access the site. Its best always to check with the webmaster to see if it is ok to scrape info. – Hishalv Jan 04 '12 at 09:18
3
Forget Net::HTTP, Open::URI is much easier. Here's some code to get you started:
require 'nokogiri'
require 'open-uri'
url = 'http://www.google.com/'
doc = Nokogiri::HTML(open(url))
doc.traverse do |el|
[el[:src], el[:href]].grep(/\.(gif|jpg|png|pdf)$/i).map{|l| URI.join(url, l).to_s}.each do |link|
File.open(File.basename(link),'wb'){|f| f << open(link,'rb').read}
end
end

the Tin Man
- 158,662
- 42
- 215
- 303

pguardiario
- 53,827
- 19
- 119
- 159
-
thanks :) I'm a newbie with ruby/rails can you please explain me where the above code stores results? – Aniruddhsinh Jan 04 '12 at 07:56
-
-
-
@ pguardiario your above code works fine but instead of only `gif|jpg|png|pdf` files, Can I grep any extensioned file? I mean need your much help in getting all type of files in webpage :) – Aniruddhsinh Jan 06 '12 at 09:36
-
Yes, you would just adjust the regex for whatever file extensions you are interested in – pguardiario Jan 06 '12 at 23:20
-
This was just what I was looking for. I just added a simple `pages.each` around it (where `pages` is an array of 17 urls I need scanned) and it's working like a charm this very moment! – lyonsinbeta Aug 02 '12 at 14:13