1

How can I extract all href options in an <a> tag from a page while reading in a file?

If I have a text file that contains the target URLs:

http://mypage.com/1.html
http://mypage.com/2.html
http://mypage.com/3.html
http://mypage.com/4.html

Here's the code I have:

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    # set the page_url to the current line 
    page = Nokogiri::HTML(open(line))
    links = page.css("a")
    puts links[0]["href"]
  end
end
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
user3610137
  • 283
  • 1
  • 5
  • 14

2 Answers2

2

I'd flip it around. I would first parse the text file and load each line into memory (assuming its a small enough data set). Then create one instance of Nokogiri for your HTML doc and extract out all href attributes (like you are doing).

Something like this untested code:

links = []
hrefs = []

File.open("myfile.txt", "r") do |f|
  f.each_line do |line|
    links << line
  end
end


page = Nokogiri::HTML(html)
page.css("a").each do |tag|
  hrefs << tag['href']
end

links.each do |link|
  if hrefs.include?(link)
    puts "its here"
  end
end
Cody Caughlan
  • 32,456
  • 5
  • 63
  • 68
0

If all I wanted to do was output the 'href' for each <a>, I'd write something like:

File.foreach('myfile.txt') do |url|
  page = Nokogiri::HTML(open(url))
  puts page.search('a').map{ |link| link['href'] }
end

Of course <a> tags don't have to have a 'href' but puts won't care.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303