0

I'm trying to scrape a dynamic website and I need Selenium.

The links that I want to scrape only open if I click on that specific element. They are being opened by jQuery, so my only option is to click on them because there is no href attribute or anything that would give me an URL.

My approach is this one:

# -*- coding: utf-8 -*-
import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            element = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "tableRepeat2"))
            )
        finally:
            html = driver.page_source
            response_obj = Selector(text=html)
            
            links = response_obj.xpath("//tbody[@id='tableRepeat2']")
            for link in links:
                driver.execute_script("arguments[0].click();", link)
                
                yield {
                    'Ocupatia': response_obj.xpath("//div[@id='print']/p/text()[1]")
                }

but it won't work.

On the line where I want to click on that element, I get this error:

TypeError: Object of type Selector is not JSON serializable

I kind of understand this error, but I have no idea how to solve it. I somehow need to transform that object from a Selector to a Clickable button.

I checked online for solutions and also the docs, but I couldn't find anything useful.

Can anybody help me better understand this error and how should I fix it?

Thanks.

  • always put full error message (starting at word "Traceback") in question (not comment) as text (not screenshot, not link to external portal). There are other useful information. – furas Sep 17 '21 at 16:37
  • did you try to do directly `link.click()` ? – furas Sep 17 '21 at 16:39
  • maybe first you could use `print()` to see what exactly you have in variables which you have in line which makes problem. OR maybe you should use directly JavaScript to get your xpath and click elements. – furas Sep 17 '21 at 16:40
  • I check HTML in DevTools in Firefox and I see `event` assigned to `` and maybe you should search all `tr` instead of `tbody` and maybe they will clickable directly with selenium `.click()`. – furas Sep 17 '21 at 16:42
  • you should use `links = driver.find_elements_by_xpath(".../tr")` and then you can even use `link.click()`. You can't mix object from Selenium with objects from Scrapy. – furas Sep 17 '21 at 17:04

2 Answers2

1

Actually, data is also generating from API calls JSON response and you can easily scrape from API. Here is the working solution along with pagination. Each page contains 8 items where total items 32.

CODE:

import scrapy
import json

class AnofmSpider(scrapy.Spider):

    name = 'anofm'

    def start_requests(self):
        yield scrapy.Request(
            url='https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit=8&localitate=',
            method='GET',
            callback=self.parse,
            meta= {
                'limit': 8}
                )


    def parse(self, response):
        resp = json.loads(response.body)
        hits = resp.get('lmv').get('data')
        for h in hits:
            yield {
                'Ocupatia': h.get('OCCUPATION')
            }


        total_limit = resp.get('lmv').get('total')
        next_limit = response.meta['limit'] + 8
        if next_limit <= total_limit:
            yield scrapy.Request(
                url=f'https://www.anofm.ro/dmxConnect/api/oferte_bos/oferte_bos_query2L_Test.php?offset=8&cauta=&select=Covasna&limit={next_limit}&localitate=',
                method='GET',
                callback=self.parse,
                meta= {
                    'limit': next_limit}
                    )
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • 1
    Thanks a lot. It did work. I have to look into the code for a bit, but I didn't even know about this approach. It has definitely helped me. :) – Jonathan Simpson Sep 17 '21 at 22:06
  • Can you tell me how did you find that API link? I'd like to know myself for future websites, but I have no idea how to find it. – Jonathan Simpson Sep 18 '21 at 09:16
  • Open the[ url ](https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1), click right button from your mouse and you will pop up dialogue box from the pop up click on inspect. then html dom will come appear then click on NETWORK TAB, click on FETCH/XHR and press ctrl+R and from the "Name" column and from here you have to find : https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1 and right click then click on preview to see data. Thanks – Md. Fazlul Hoque Sep 18 '21 at 09:46
  • Actually, it's a little bit complex, take more online help and one day you will see that this is very efficient and easy to scrape. Remember it,, most of the dynamic site is this way and a web scraping developer must need to know this way to scrape data. – Md. Fazlul Hoque Sep 18 '21 at 09:54
  • I've found many APIs in the network tab just before you answered, and I saw it myself. Thanks for your answer anyway – Jonathan Simpson Sep 18 '21 at 10:28
1

You mix Scrapy object with Selenium functions and this makes problem. I don't know how to convert objects but I would simply use only Selenium for this

        finally:

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                # doesn't work for me - even
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

Full working code.

Everyone can put it in one file and run python script.py without creating project in scrapy.

You have to change SELENIUM_DRIVER_EXECUTABLE_PATH to correct path.

import scrapy

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
import time

class AnofmSpider(scrapy.Spider):
    name = 'anofm'
    
    def start_requests(self):
        yield SeleniumRequest(
            url='https://www.anofm.ro/lmvw.html?agentie=Covasna&categ=3&subcateg=1',
            #callback=self.parse
        )

    def parse(self, response):  
        driver = response.meta['driver'] 
        try:
            print("try")
            element = WebDriverWait(driver, 20).until(
                EC.presence_of_element_located((By.XPATH, "//tbody[@id='tableRepeat2']/tr/td"))
            )
        finally:
            print("finally")

            links = driver.find_elements_by_xpath("//tbody[@id='tableRepeat2']/tr")
            print('len(links):', len(links))
            
            for link in links:
                #driver.execute_script("arguments[0].scrollIntoView();", link)
                #link.click()
                
                # open information
                driver.execute_script("arguments[0].click();", link)
                
                # javascript may need some time to display it
                time.sleep(1)
                
                # get data
                ocupatia = driver.find_element_by_xpath(".//div[@id='print']/p").text
                ocupatia = ocupatia.split('\n', 1)[0]        # first line
                ocupatia = ocupatia.split(':', 1)[1].strip() # text after first `:`
                print('Ocupatia -->', ocupatia)

                # close information
                driver.find_element_by_xpath('//button[text()="Inchide"]').click()

                yield {
                    'Ocupatia': ocupatia
                }

# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    # save in file CSV, JSON or XML
    'FEEDS': {'output.csv': {'format': 'csv'}},  # new in 2.1

    'DOWNLOADER_MIDDLEWARES': {'scrapy_selenium.SeleniumMiddleware': 800},

    'SELENIUM_DRIVER_NAME': 'firefox',
    'SELENIUM_DRIVER_EXECUTABLE_PATH': '/home/furas/bin/geckodriver',
    'SELENIUM_DRIVER_ARGUMENTS': [], # ['-headless']
})
c.crawl(AnofmSpider)
c.start() 
furas
  • 134,197
  • 12
  • 106
  • 148
  • I accepted the other answer just because I felt like the solution was easier, yours is working too. Thanks for your time, I will also look into your approach just to learn something new :) – Jonathan Simpson Sep 17 '21 at 22:11
  • If I would have to write code for this page then I would also use method with API :) It easier to get data and it doesn't need Selenium so it should run much faster. I always first check in DevTools in Firefox/Chrome if there are XHR requests - like @Fazlul described in comment. So I also upvoted other answer. – furas Sep 18 '21 at 12:25