0

so I have this problem where I am to list every country in a list in Excel by using Open-URI. Everything is working properly but I can't seem to figure how to get my RegExp-"string" to include single-named countries (like "Sweden") but also countries like South Africa that is separated with a whitespace etc. I hope i've made myself understood fairly and below I will include the relevant pieces of code.

the text I want to match is the following (for example):

<a href="wf.html">Wallis and Futuna</a>
<a href="ym.html">Yemen</a>

I am currently stuck with this Regexp:

/a.+="\w{2}.html">(\w*)<.+{1}/

As you see, there is no problem with matching 'Yemen'. Though I still want the code to be able to match both "Wallis and Futuna AND Yemen. Perhaps if there was a way to include everything inside the given ">blabla bla<"? Any thoughts? I would be very grateful!

user1937198
  • 4,987
  • 4
  • 20
  • 31
Fjurg
  • 487
  • 3
  • 10

2 Answers2

5

It is generally bad to use Regex when dealing with HTML entity extraction

require 'nokogiri' 

parser = Nokogiri::HTML.parse(your_html)
country_links = parser.css("a")
country_links.each{|link| puts link['href']; puts link.text;}
Michael Papile
  • 6,836
  • 30
  • 30
  • I agree. Using regexes for HTML „parsing“ has proven to be a bad idea. – Patrick Oscity Mar 25 '13 at 18:29
  • +1 This is the only really bullet-proof solution. HTML is too irregular for a regex pattern to handle. – the Tin Man Mar 25 '13 at 19:06
  • Thank you for your answer. I tried your solution with successful result. I might add that the program is only for educational purposes and nevertheless I appriciate your input! – Fjurg Mar 25 '13 at 20:19
1

For your test sample,

/<a[^>]+href="\w{2}.html">([\w\s]+)<\/a>/
Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • A regex based solution is fragile and will fail if the HTML changes. Imagine what will happen if the `href` was missing a trailing `"`, had ` = ` instead of `=`, had punctuation inside the link text, or the tag was missing a closing ``. – the Tin Man Mar 25 '13 at 19:09