-1

I need to find distance between two websites useing ruby open-uri. Using

def check(url)
    site = open(url.base_url)
    link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$}
    site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)}
    p url.links
end

Finding links not working properly. Any ideas why ?

Torianin
  • 171
  • 2
  • 9
  • 3
    None at all, without knowing what kind of structure `url` has, or what your error is. – Thilo Nov 12 '12 at 23:19

2 Answers2

4

If you want to find the a tags' href parameters, use the right tool, which isn't often a regex. More likely you should use a HTML/XML parser.

Nokogiri is the parser of choice with Ruby:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.example.org/index.html'))
doc.search('a').map{ |a| a['href'] }

pp doc.search('a').map{ |a| a['href'] }
# => [
# =>  "/",
# =>  "/domains/",
# =>  "/numbers/",
# =>  "/protocols/",
# =>  "/about/",
# =>  "/go/rfc2606",
# =>  "/about/",
# =>  "/about/presentations/",
# =>  "/about/performance/",
# =>  "/reports/",
# =>  "/domains/",
# =>  "/domains/root/",
# =>  "/domains/int/",
# =>  "/domains/arpa/",
# =>  "/domains/idn-tables/",
# =>  "/protocols/",
# =>  "/numbers/",
# =>  "/abuse/",
# =>  "http://www.icann.org/",
# =>  "mailto:iana@iana.org?subject=General%20website%20feedback"
# => ]
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

I see several issues with this regular expression:

  • It is not necessarily the case that a space must come before the trailing slash in an empty tag, yet your regexp requires it

  • Your regexp is very verbose and redundant

Try the following instead, it will extract you the URL out of <a> tags:

link = /<a \s   # Start of tag
    [^>]*       # Some whitespace, other attributes, ...
    href="      # Start of URL
    ([^"]*)     # The URL, everything up to the closing quote
    "           # The closing quotes
    /x          # We stop here, as regular expressions wouldn't be able to
                # correctly match nested tags anyway
Sjlver
  • 1,227
  • 1
  • 12
  • 28