2

I'm trying to use Mechanize to scape some tags from a page. I've used Nokogiri successfully to scrape them before, but now I'm trying to combine them into a wider Mechanize class. Here is the Nokogiri statement:

page = Nokogiri::HTML(open(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT']))
@model.icons = page.css("link[rel='apple-touch-icon']").to_s

And here is what I thought would be the Mechanize equivalent but it's not working:

agent = Mechanize.new
page = agent.get(@model.url, "User-Agent" => request.env['HTTP_USER_AGENT'])
@model.icons = page.search("link[rel='apple-touch-icon']").to_s

The first one returns a link tag as expected <link rel="apple-touch-icon" etc etc..></link>. The second statement returns a blank string. If I take the to_s off the end I get a super long output. I assume it's an error or the actual Mechanize object or something.

Link to long output when not converting to string: https://gist.github.com/eadam/5583541

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Adam
  • 2,917
  • 5
  • 26
  • 27
  • 1
    Define "not working". What do you get as the return value of the `search` method? What do you expect to get? It would also be helpful if you pointed us to the page or included the appropriate snippet. – Mark Thomas May 14 '13 at 00:42
  • I've update the question with full statements and definition of not working. Thanks. – Adam May 14 '13 at 22:37
  • 1
    Can you post the "super long output" you get? – Mark Thomas May 15 '13 at 02:27
  • Please don't put a link to your "long output". *WHEN* that link breaks your question will be pretty useless for others who are looking for the same answer in the future. Instead, summarize what you are linking to if it's truly that long, and provide the link to the complete information. If it's not really that long, append it to your question. – the Tin Man May 15 '13 at 19:10

1 Answers1

1

Without sample HTML it's difficult to recreate the problem, so this is some general information that might help you.

That "long output" is the inspect output of the Nokogiri::NodeSet you got when you used the search method. If search returns multiple nodes, or the nodes have lots of children, the inspect output can go on for a ways, but, that's what it should do.

css and search are very similar, in that they return a NodeSet. css assumes that the string passed in is a CSS accessor, while search is more generic, and attempts to figure out whether what was passed in was a CSS or XPath expression. If it figures wrong the odds are bad for the pattern to find a match. You can use at or search to be generic and let Nokogiri figure it out, or at_css, at_xpath or css and xpath to respectively replace them. The at derivations all return the first matching Node, similar to using search('some_path').first.

to_s turns the NodeSet back into a representation of the source that was passed in. I prefer to be more explicit, using either to_xml, to_xhtml or to_html.

Why don't you get output for search like you do for css? I don't know because I can't test against the HTML you're parsing. Answering questions, like data-processing, is a GIGO situation.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303