1

In my text file are a list of URLs. Using Mechanize I'm using that list to parse out the title and meta description. However, some of those URL pages don't have a meta description which stops my script with a nil error:

undefined method `[]' for nil:NilClass (NoMethodError)

I've read up and seen solutions if I were using Rails, but for Ruby I've only seen reject and compact as possible solutions to ignore nil values. I added compact at the end of the loop, but that doesn't seem to do anything.

require 'rubygems'
require 'mechanize'

File.readlines('parsethis.txt').each do |line|
    page = Mechanize.new.get(line)
    title = page.title
    metadesc = page.at("head meta[name='description']")[:content]
    puts "%s, %s, %s" % [line.chomp, title, metadesc] 
end.compact!

It's just a list of urls in a text like this:

http://www.a.com
http://www.b.com

This is what will output in the console for example:

 http://www.a.com, Title, This is a description.

If within the list of URLs there is no description or title on that particular page, it throws up the nil error. I don't want it to skip any urls, I want it to go through the whole list.

  • One way is `page.at("head meta[name='description']").attributes[:content]`.. – Arup Rakshit May 08 '15 at 16:12
  • What do you want to have happen when a page doesn't have the description content? Skip that entry entirely or return a string without two ending commas? Why do you `puts` the generated string? Do you want the screen output and then rely on the side-effect of `puts` returning the value that was output? In general this isn't how we'd write this code so a better description of the data that you want would help. Also, it'd REALLY help if you gave some sample URLs and an example of the output. This seems like an XY problem and you asked about Y but need to ask about X. – the Tin Man May 08 '15 at 16:34
  • @theTinMan Not really that complex. I don't want any of the entries to be skipped, just when it prints out on console (via puts) that it doesn't throw up an error just because there's no description. The mechanize gem allows you get data from webpages as long as you have the url. So I have a text file that has a list of urls line by line. Console will output the url, in addition to its associated html title and meta description like this: http:// www.a.com, Title, This is a description each line in the console. – mr. greybox tester May 08 '15 at 16:42
  • Please edit your question and add the information there where everyone can easily find it. And show an input and output example. – the Tin Man May 08 '15 at 16:43
  • @theTinMan Added the info to the question. – mr. greybox tester May 08 '15 at 16:49

2 Answers2

0

Here is one way to do it:

Edit( for added requirement to not skip any url's):

metadesc = page.at("head meta[name='description']")
puts "%s, %s, %s" % [line.chomp, title, metadesc ? metadesc[:content] : "N/A"]
seph
  • 6,066
  • 3
  • 21
  • 19
  • Thanks, but it seemed to throw up the nil error immediately rather than generate through the list before erroring out once there is a url without a description or title. – mr. greybox tester May 08 '15 at 16:53
  • The ternary operator to make the nil value a conditional works. I'll soon adjust to have it set for titles also. Thanks for your help! – mr. greybox tester May 08 '15 at 19:25
0

This is untested but I'd do something like this:

require 'open-uri'
require 'nokogiri'

page_info = {}
File.foreach('parsethis.txt') { |url|
  page = Nokogiri::HTML(open(url))
  title = page.title
  meta_desc = page.at("head meta[name='description']")
  meta_desc_content = meta_desc ? meta_desc[:content] : nil
  page_info[url] = {:title => title, :meta_desc => meta_desc_content} 
}

page_info.each do |url, info|
  puts [
    url,
    info[:title],
    info[:meta_desc]
  ].join(', ')
end
  • File.foreach iteratively reads a file, returning each line individually.
  • page.title could return a nil if a page doesn't have a title; titles are optional in pages.
  • I break down accessing the meta-description into two steps. Meta tags are optional in HTML so they might not exist, at which point a nil would be returned. Trying to access a content= parameter would result in an exception. I think that's what you're seeing.

    Instead, in my code, meta_desc_content is conditionally assigned a value if the meta-description tag was found, or nil.

The code populates the page_info hash with key/value pairs of the URL and its associated title and meta-description. I did it this way because a hash-of-hashes, or possibly an array-of-hashes, is a very convenient structure for all sorts of secondary manipulations, such as returning the information as JSON or inserting into a database.

As a second step the code iterates over that hash, retrieving each key/value pair. It then joins the values into a string and prints them.

There are lots of things in your code that are either wrong, or not how I'd do them:

  • File.readlines('parsethis.txt').each returns an array which you then have to iterate over. That isn't scalable, nor is it efficient. File.foreach is faster than File.readlines(...).each so get in the habit of using it unless you are absolutely sure you know why you should use readlines.
  • You use Mechanize for something that Nokogiri and OpenURI can do faster. Mechanize is a great tool if you are working with forms and need to navigate a site, but you're not doing that, so instead you're dragging around additional code-weight that isn't necessary. Don't do that; It leads to slow programs among other things.
  • page.at("head meta[name='description']")[:content] is an exception in waiting. As I said above, meta-descriptions are not necessarily going to exist in a page. If it doesn't then you're trying to do nil[:content] which will definitely raise an exception. Instead, work your way down to the data you want so you can make sure that the meta-description exists before you try to get at its content.
  • You can't use compact or compact! the way you were. An each block doesn't return an array, which is the class you need for compact or compact!. You could have used map but the logic would have been messy and puts inside map is rarely used. (Probably shouldn't be used is more likely but that's a different subject.)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Well...OpenURI couldn't handle the load and timed out when running it. Thanks for your opinions and explanation of them though. – mr. greybox tester May 08 '15 at 19:37
  • A timeout is rarely a case of OpenURI having a problem. It's more likely the server refused to send a file and closed the connection. That can happen if you're repeatedly hitting a server and exceed their limits for connections, which can happen very easily if you're not being a good network-citizen. – the Tin Man May 08 '15 at 22:52