How to get all image, pdf and other files links from a web page?

Question

I have to develop a Ruby on Rails application which fetches all the images, pdf, cgi, etc. file extension links from web page.

@SergioTulentsev thanks :) if you could let us know which method or function in Net::HTTP seems to much useful.. then it would be big help for me :) — Aniruddhsinh, Jan 04 '12 at 06:40

the Tin Man · Answer 1 · 2012-01-04T09:44:07.890

7

The easiest way to grab links from pages is to use URI.extract. From the docs:

Description

Extracts URIs from a string. If block given, iterates through all matched URIs. Returns nil if block given or array with matches.

Usage

require "uri"

URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
# => ["http://foo.example.com/bla", "mailto:test@example.com"]

Looking at this page:

require 'open-uri'
require 'uri'

html = open('http://stackoverflow.com/questions/8722693/how-to-get-all-image-pdf-and-other-files-links-from-a-web-page/8724632#8724632').read

puts URI.extract(html).select{ |l| l[/\.(?:gif|png|jpe?g)\b/]}

which returns:

http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
http://sstatic.net/stackoverflow/img/apple-touch-icon.png
http://foobar.com/path/to/file.gif?some_query=1
http://pixel.quantserve.com/pixel/p-c1rF4kxgLUzNc.gif

edited Jan 04 '12 at 09:44

answered Jan 04 '12 at 09:32

the Tin Man

158,662
42
215
303

Thanks for such clear and lucid explanation. but when i try [http://testasp.vulnweb.com](http://testasp.vulnweb.com) it doesnt return anything :( – Aniruddhsinh Jan 04 '12 at 12:31
There is one image in the page and it uses a relative `src`, which is why `URI.extract` can't find it. It expects a true URL. Nokogiri makes it easy to extract only the image tags: `doc.search('img')` will return a list of `img` nodes. Extract the `src` attribute and you're done. – the Tin Man Jan 04 '12 at 18:35
Can you please help me in correcting the above code that is how Can I make that relative url to absolute url and I do have to search all the file extensions like pdf,css and etc.. so I am not using `doc.Searcj('img')` of nokogiri – Aniruddhsinh Jan 06 '12 at 06:29

score 4 · Answer 2 · edited Oct 10 '16 at 22:26

4

Have you tried the following tutorials to learn how to parse a web page first:

Also, just as a note, be careful what sites you parse. It seems like getting all those PDF, images, etc. might be noticed by the site you are trying to parse. I learned the hard way.

Sometimes you might be able to get info from feeds. Try this:

Feed Parsing

edited Oct 10 '16 at 22:26

the Tin Man

158,662
42
215
303

answered Jan 04 '12 at 07:09

Hishalv

3,052
3
29
52

Thanks :) Have you any idea that how any site-admin/web-master can detect/notice that such files are parsed? – Aniruddhsinh Jan 04 '12 at 09:11
@user1027702 not sure i think they blacklist your ip or something. A while back i tried to scrape info from a site and next thing i could not access the site. Its best always to check with the webmaster to see if it is ok to scrape info. – Hishalv Jan 04 '12 at 09:18

score 3 · Answer 3 · edited Oct 10 '16 at 22:26

3

Forget Net::HTTP, Open::URI is much easier. Here's some code to get you started:

require 'nokogiri'
require 'open-uri'

url = 'http://www.google.com/'
doc = Nokogiri::HTML(open(url))
doc.traverse do |el|
    [el[:src], el[:href]].grep(/\.(gif|jpg|png|pdf)$/i).map{|l| URI.join(url, l).to_s}.each do |link|
        File.open(File.basename(link),'wb'){|f| f << open(link,'rb').read}
    end
end

edited Oct 10 '16 at 22:26

the Tin Man

158,662
42
215
303

answered Jan 04 '12 at 06:54

pguardiario

53,827
19
119
159

thanks :) I'm a newbie with ruby/rails can you please explain me where the above code stores results? – Aniruddhsinh Jan 04 '12 at 07:56
It saves the files in the current directory. – pguardiario Jan 04 '12 at 08:11
It's a big help for me :) thanks :) – Aniruddhsinh Jan 04 '12 at 09:33
@ pguardiario your above code works fine but instead of only `gif|jpg|png|pdf` files, Can I grep any extensioned file? I mean need your much help in getting all type of files in webpage :) – Aniruddhsinh Jan 06 '12 at 09:36
Yes, you would just adjust the regex for whatever file extensions you are interested in – pguardiario Jan 06 '12 at 23:20
This was just what I was looking for. I just added a simple `pages.each` around it (where `pages` is an array of 17 urls I need scanned) and it's working like a charm this very moment! – lyonsinbeta Aug 02 '12 at 14:13

How to get all image, pdf and other files links from a web page?

3 Answers3