2

I am trying to scrape a website and save the information and I have two issues at the moment.

For one, when I am using selenium to click buttons (in this case a load more results button) it is not clicking until the end and I can't seem to figure out why.

And the other issue is that it is not saving to a csv file in the parse_article function.

Here is my code:

import scrapy
from selenium import webdriver
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from selenium.webdriver.common.by import By
import csv


class ProductSpider(scrapy.Spider):
    name = "Southwestern"
    allowed_domains = ['www.reuters.com/']
    start_urls = [
        'https://www.reuters.com/search/news?blob=National+Health+Investors%2c+Inc.']

    def __init__(self):
        self.driver = webdriver.Chrome()

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_class_name(
                "search-result-more-txt")
        #next = self.driver.find_element_by_xpath('//*[@id="content"]/section[2]/div/div[1]/div[4]/div/div[4]/div[1]')
        # maybe do it with this
        #button2 = driver.find_element_by_xpath("//*[contains(text(), 'Super')]")
            try:
                next.click()

            # get the data and write it to scrapy items
            except:
                break

        SET_SELECTOR = '.search-result-content'
        for articles in self.driver.find_elements(By.CSS_SELECTOR, SET_SELECTOR):
            item = {}
            # get the date
            item["date"] = articles.find_element_by_css_selector('h5').text
            # title
            item["title"] = articles.find_element_by_css_selector('h3 a').text

            item["link"] = articles.find_element_by_css_selector(
                'a').get_attribute('href')

            print(item["link"])

            yield scrapy.Request(url=item["link"], callback=self.parse_article, meta={'item': item})
        self.driver.close()

    def parse_article(self, response):
        item = response.meta['item']

        texts = response.xpath(
            "//div[contains(@class, 'StandardArticleBody')]//text()").extract()
        if "National Health Investors" in texts:
            item = response.meta['item']
            row = [item["date"], item["title"], item["link"]]
            with open('Websites.csv', 'w') as outcsv:
                writer = csv.writer(outcsv)
                writer.writerow(row)
zamir
  • 2,144
  • 1
  • 11
  • 23
  • In your `parse_article` method you always rewrite your `Websites.csv` file, you should append to it instead. – Szabolcs Feb 07 '18 at 14:40

2 Answers2

1
  1. Try to wait a bit after click so that data will be loaded. I suppose sometimes your script searches for a button before new data and a new button were displayed.

Try using implicit_wait or explicit_wait:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element
# (or elements) not immediately available.
driver.implicitly_wait(implicit_wait)

# An explicit wait is code you define to wait for a certain condition to occur before proceeding further
# in the code.
wait = WebDriverWait(self.driver, <time in seconds>)
wait.until(EC.presence_of_element_located((By.XPATH, button_xpath)))
  1. 'w' is for writing only (an existing file with the same name will be erased). Try 'a' (appending) argument. Though I would recommend using pipelines: link
Alex K.
  • 835
  • 6
  • 15
-1

First issue looks like button hasn't appeared. Maybe this can aid you.

One more thing, try to close driver when Scrapy is shutting down. Probably this can help you.

Second issue looks like you are going to do open and write many times and that is not good, since you will be overwriting existing contents. Even with 'a' flag e.g. open(FILE_NAME, 'a') this is not good practice in Scrapy.

Try to create Item populate it and then use Pipelines mechanism for saving items in CSV file. Something like here.

Ilija
  • 1,556
  • 1
  • 9
  • 12