scanning a webpage for urls with ruby and regex

Question

I'm trying to create an array of all links found at the below url. Using page.scan(URI.regexp) or URI.extract(page) returns more than just urls.

How do I get just the urls?

require 'net/http'
require 'uri'

uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7")
page = Net::HTTP.get(uri)
p page.scan(URI.regexp)
p URI.extract(page)

Casper · Accepted Answer · 2016-10-09T13:42:32.873

If you are just trying to extract links (<a href="..."> elements) from the text file then it seems better to parse it as real HTML with Nokogiri, and then extract the links this way:

require 'nokogiri'
require 'open-uri'

# Parse the raw HTML text
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt'))

# Extract all a-elements (HTML links)
all_links = doc.css('a')

# Sort + weed out duplicates and empty links
links = all_links.map { |link| link.attribute('href').to_s }.uniq.
        sort.delete_if { |h| h.empty? }

# Print out some of them
puts links.grep(/store/)

http://store.steampowered.com/app/214590/
http://store.steampowered.com/app/218090/
http://store.steampowered.com/app/220780/
http://store.steampowered.com/app/226720/
...

scanning a webpage for urls with ruby and regex

1 Answers1