I am working on a javascript capable screen-scraper using capybara/dsl, selienium webdriver, and the spreadsheet gem. Very close to the desired output however two major problems arise:
I have not been able to figure out the exact xpath selector to filter out only the elements I'm looking for; to ensure that none are missing I am using a broad selector that I know will produce duplicate elements. I was planning on just calling .uniq on that selector but this throws an error. What is the proper way to do this results in the desired filtering. The error is an undefined no method for 'uniq'. Maybe I'm not using it properly:
results = all("//a[contains(@onclick, 'analyticsLog')]").uniq
. I know that the xpath that I have chosen to extract hrefs://a[contains(@onclick, 'analyticsLog')]
will define more nodes than I intended because using find to inspect the page elements shows 144 rather than 72 that make up the page results. I have looked for a more specific selector however I haven't been able to find one without filtering out some desired links due to the business logic used on the site.My save_item method has two selectors that are not always found within the info results, I would like the script to just skip those that aren't found and save only the ones that are however my current iteration will throw a Capybara::ElementNotFound and exit. How could I configure this to work in the intended way.
#
code below
#
require "capybara/dsl"
require "spreadsheet"
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
@excel = Spreadsheet::Workbook.new
@work_list = @excel.create_worksheet
@row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.some.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//a[contains(@onclick, 'analyticsLog')]")# I would like to use .uniq here to filter out the duplicates that I know will be delivered by this selector
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
@excel.write "inventory.csv"
end
def save_item
data = all("//*[@id='content-wrapper']/div[2]/div/div")
data.each do |info|
@work_list[@row, 0] = info.find("//*[@id='productright']/div/div[1]/h1").text
@work_list[@row, 1] = info.find("//div[contains(@class, 'price font left')]").text
@work_list[@row, 2] = info.find("//*[@id='productright']/div/div[11]").text
@work_list[@row, 3] = info.find("//*[@id='tabcontent1']/div/div").text.strip
@work_list[@row, 4] = info.find("//select[contains(@name, 'options[747]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
@work_list[@row, 5] = info.find("//select[contains(@name, 'options[748]')]//*[@price='0']").text #I'm aware that this will not always be found depending on the item in question but how do I ensure that it doesn't crash the program
@row = @row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go