calling .uniq on capybara xpath selectors and logically bypassing Capybara::ElementNotFound

Question

I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:

I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly: results = all("//a[contains(@onclick, 'analyticsLog')]").uniq. I know that the xpath that I have chosen to extract hrefs: //a[contains(@onclick, 'analyticsLog')] will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.
My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.

#

code below

#

require "capybara/dsl"
require "spreadsheet"

 Capybara.run_server = false
 Capybara.default_driver = :selenium
 Capybara.default_selector = :xpath
 Spreadsheet.client_encoding = 'UTF-8'

 class Tomtop
   include Capybara::DSL

   def initialize
     @excel = Spreadsheet::Workbook.new
     @work_list = @excel.create_worksheet
     @row = 0
   end

   def go
     visit_main_link
   end

   def visit_main_link
     visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
     results = all("//a[contains(@onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
     item = []

     results.each do |a|
       item << a[:href]
     end
     item.each do |link|
          visit link
          save_item
      end
     @excel.write "inventory.csv"

   end

   def save_item

     data = all("//*[@id='content-wrapper']/div[2]/div/div")
     data.each do |info|
       @work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
       @work_list[@row, 1] = info.find("//div[contains(@class, 'price font left')]").text
       @work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
       @work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
       @work_list[@row, 4] = info.find("//select[contains(@name, 'options[747]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
       @work_list[@row, 5] = info.find("//select[contains(@name, 'options[748]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
       @row = @row + 1
     end

   end

 end


 tomtop = Tomtop.new
 tomtop.go

"but this throws an error." Can you tell us what the error message is? Also, what makes you think your XPath expression will select some nodes multiple times? That shouldn't happen with XPath. — LarsH, Dec 04 '13 at 19:42
P.S. you might need to post a second Question for your second question. — LarsH, Dec 04 '13 at 19:43
I will repost the second question if I don't receive a response here; I know that the xpath that I have chosen to extract hrefs: //a[contains(@onclick, 'analyticsLog')] will define more nodes than I intended because that is what happens when I inspect the elements on the page. I have looked for a selector that is more specific however I haven't been able to find one without filtering out some desired links due to the business logic used on the site. — jcuwaz, Dec 04 '13 at 19:57

Justin Ko · Accepted Answer · 2013-12-06T00:50:44.593

1

For Question 1: Get unique elements

All of the elements returned by all are unique. Therefore, I assume by "unique" elements, you mean that the "onclick" attribute is unique.

The collection of elements returned by Capybara is an enumerable. Therefore, you can convert it to an array and then take the unique element's based on their onclick attribute:

results = all("//a[contains(@onclick, 'analyticsLog')]")
            .to_a.uniq{ |e| e[:onclick] }

Note that it looks like the duplicate links are due to one for the image and one for the text below the image. You could scope your search to just one or the other and then you would not need to do the uniq check. To scope to just the text link, use the fact that the link is a child of an h5:

results = all("//h5/a[contains(@onclick, 'analyticsLog')]")

For Question 2: Capture text if element present

To solve your second problem, you could use first to locate the element. This will return the matching element if one exists and nil if one does not. You could then save the text if the element is found.

For example:

el = info.first("//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = el.text if el

If you want the text of all matching elements, then use all:

options = info.all(".//select[contains(@name, 'options[747]')]//*[@price='0']")
@work_list[@row, 4] = options.collect(&:text).join(', ')

When there are multiple matching options, you will get something like "Green, Pink". If there are no matching options, you will get "".

edited Dec 06 '13 at 00:50

answered Dec 04 '13 at 20:05

Justin Ko

46,526
5
91
101

Much appreciated Justin, this was exactly what I needed. – jcuwaz Dec 04 '13 at 20:43
Justin the .first doesn't always work because their are certain criteria where I want more that the first element of the array things like size,color, and images to be printed to the spreadsheet however I still need it to return a nil if the selector is not present. How could I configure that. – jcuwaz Dec 05 '13 at 22:13
I am not really clear on what you are asking. Can you give an html sample and what you expect (might be better as a new question if it is not related to this one). – Justin Ko Dec 05 '13 at 22:32
question 2: color = info.first("//dd[1]//select[contains(@name, 'options')]//*[@price='0']") @work_list[@row, 4] = color.text if color only works when their is one element in array but not when their are multiples results like sizes and colors; I've been trying to find a call that will return all of the elements of the array but still evaluate to nil if the selector that defines that array is not present. Link: http://www.tomtop.com/clothing-accessories/fashion-ol-women-chiffon-shirt-pleated-front-long-sleeve-button-blouse-tops-g0377.html test selector: "//*[@rel='lightbox[rotation]']" – jcuwaz Dec 06 '13 at 00:14
I assume you mean you want to get the text of all options that meet the selector. You can use `all` and join the text. See updated answer. – Justin Ko Dec 06 '13 at 00:51
Does that info.all and .collect .join structure work for images as well; I'm trying to download and reference in the spreadsheet all of the absolute images that are defined by the following selector: "//*[@rel='lightbox[rotation]']" . Thanks so much for following up on this. – jcuwaz Dec 06 '13 at 01:50
Yes, in theory, it should work for any Capybara element collection. – Justin Ko Dec 06 '13 at 02:18

calling .uniq on capybara xpath selectors and logically bypassing Capybara::ElementNotFound

#

#

1 Answers1