Finding all links from ten URLs while reading a file

Question

How can I extract all href options in an <a> tag from a page while reading in a file?

If I have a text file that contains the target URLs:

http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html

Here's the code I have:

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    # set the page_url to the current line 
    page = Nokogiri::HTML(open(line))
    links = page.css("a")
    puts links[0]["href"]
  end
end

Cody Caughlan · Accepted Answer · 2015-10-21T19:11:04.230

I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).

Something like this untested code:

links = []
hrefs = []

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    links << line
  end
end


page = Nokogiri::HTML(html)
page.css("a").each do |tag|
  hrefs << tag['href']
end

links.each do |link|
  if hrefs.include?(link)
    puts "its here"
  end
end

score 0 · Answer 2 · answered Oct 31 '15 at 00:07

If all I wanted to do was output the 'href' for each <a>, I'd write something like:

File.foreach('myfile.txt') do |url|
  page = Nokogiri::HTML(open(url))
  puts page.search('a').map{ |link| link['href'] }
end

Of course <a> tags don't have to have a 'href' but puts won't care.

Finding all links from ten URLs while reading a file

2 Answers2