0

I have some 'generic' methods that extract data based on css selectors that usually are the same in many websites. However I have another method that accept as argument the css selector for a given website.

I need to call the get_title method if title_selector argument is nos passed. How can I do that?

Scrape that accept css selectors as arguments

  def scrape(urls, item_selector, title_selector, price_selector,     image_selector)
    collection = []
    urls.each do |url|
      doc = Nokogiri::HTML(open(url).read) # Opens URL
      @items = doc.css(item_selector)[0..1].map {|item| item['href']} # Sets items
      @items.each do  |item| # Donwload each link and parse
        page = Nokogiri::HTML(open(item).read)
        collection << {
          :title   => page.css(title_selector).text, # I guess I need conditional here 
          :price  => page.css(price_selector).text
        }
      end
      @collection = collection
    end
  end

Generic title extractor

  def get_title(doc)
    if doc.at_css("meta[property='og:title']")
      title = doc.css("meta[property='og:title']")
    else doc.css('title')
      title = doc.at_css('title').text
    end
  end
Community
  • 1
  • 1

1 Answers1

2

Use an or operator inside your page.css call. It will call get_title if title_selector is falsey (nil).

:title => page.css(title_selector || get_title(doc)).text,

I'm not sure what doc should actually be in this context, though.

EDIT

Given your comment below, I think you can just refactor get_title to handle all of the logic. Allow get_title to take an optional title_selector parameter and add this line to the top of your method:

return doc.css(title_selector).text if title_selector

Then, my original line becomes:

:title => get_title(page, title_selector)
Johnson
  • 1,510
  • 8
  • 15
  • Careful. `.text` is only called on one of those, and one is `css`, the other `at_css`. – tadman Jul 12 '16 at 16:54
  • Mmm, I guess I didn't explain well. Argument inside page.css() is a css selector like '#title > .title-class > h1'. So get_title(doc) inside this wont work because get_title return a scraped title page ('Amazon.com ...)not a .css selector. FYI doc = full html source in Nokogiri format. Thks. – Francisco Campaña Jul 12 '16 at 20:09
  • Ah, your explanation was fine. My reading comprehension was lacking. I *think* my edit should cover it now. – Johnson Jul 12 '16 at 20:26
  • Thanks. Your solution is perfect. – Francisco Campaña Jul 12 '16 at 21:58