0

I am currently trying to crawl the the Company Overview from alibaba.com.

For instance: https://www.alibaba.com/product-detail/T14-series-original-air-pro-TWS_1600273931389.html?spm=a2700.galleryofferlist.normal_offer.d_title.4aa778f2ahtuBx&s=p

For getting the information like company name I did:

response.xpath("//a[@class='company-name company-name-lite-vb']/text()").extract()

Which works fine.

When entering "Company Overview">"Company Profile" and than trying to crawl information from the table with:

response.xpath("//div/div[@class='content-value']").extract()

I get an empty array.

resources/search_results_searchpage.yml:

products:
    css: 'div[data-content="productItem"]'
    multiple: true
    type: Text
    children:
        link:
            css: a.elements-title-normal 
            type: Link

crawler.py:

import scrapy
import csv
#from scrapy_selenium import SeleniumRequest # only needed when using selenium
import os
from selectorlib import Extractor

class Spider(scrapy.Spider):
    name = 'alibaba_crawler'
    allowed_domains = ['alibaba.com']
    start_urls = ['http://alibaba.com/']
    link_extractor = Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__), "../resources/search_results_searchpage.yml"))

    def start_requests(self):
        search_text="Headphones"
        url="https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(search_text)

        yield scrapy.Request(url, callback = self.parse, meta = {"search_text": search_text})


    def parse(self, response):
        data = self.link_extractor.extract(response.text, base_url=response.url)
        for product in data['products']:
            parsed_url=product["link"]

            yield scrapy.Request(parsed_url, callback=self.crawl_mainpage)
            #yield SeleniumRequest(url=parsed_url, callback=self.crawl_mainpage)
    
    def crawl_mainpage(self, response):
        yield {
            'name': response.xpath("//h1[@class='module-pdp-title']/text()").extract(),
            'Year of Establishment': response.xpath("//td[contains(text(), 'Year Established')]/following-sibling::td/div/div/div/text()").extract()
         }
        

Anybody having an idea what I could do to populate Year of Est.? I tried to use scrapy_selenium and configured it correctly, because I suspect that the object is generated dynamically but still no luck or I am possibly using it wrong

tun with:

scrapy crawl alibaba_crawler -o out.csv -t csv   

1 Answers1

1

Your xpath selector is not correct. Try this

'Year of Est.': response.xpath("//td[contains(text(), 'Year Established')]/following-sibling::td/div/div/div/text()").extract()

I also note some errors in your code such as the line below which will raise an error. You may want to recheck how you extract links from the search page.

data = self.link_extractor.extract(response.text, base_url=response.url)

Edit: The year of establishment is loaded once the company tab is clicked. You have to simulate the click using selenium or scrapy-playwright. My simple implementation using scrapy-playwright is as below.

import scrapy
from scrapy.crawler import CrawlerProcess
import os
from selectorlib import Extractor
from scrapy_playwright.page import PageCoroutine


class Spider(scrapy.Spider):
    name = 'alibaba_crawler'
    allowed_domains = ['alibaba.com']
    start_urls = ['http://alibaba.com/']
    link_extractor = Extractor.from_yaml_file(os.path.join(os.path.dirname(__file__), "../resources/search_results_searchpage.yml"))

    def start_requests(self):
        search_text = "Headphones"
        url = "https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText={0}&viewtype=G".format(
            search_text)
        yield scrapy.Request(url, callback=self.parse, meta={"search_text": search_text})

    def parse(self, response):
        data = self.link_extractor.extract(
            response.text, base_url=response.url)
        for product in data['products']:
            parsed_url = product["link"]

            yield scrapy.Request(parsed_url, callback=self.crawl_mainpage, meta={"playwright": True, 'playwright_page_coroutines': {
                    "click": PageCoroutine("click", selector="//span[@title='Company Profile']"),
                },})

    def crawl_mainpage(self, response):
        yield {
            'name': response.xpath("//h1[@class='module-pdp-title']/text()").extract(),
            'Year of Establishment': response.xpath("//td[contains(text(), 'Year Established')]/following-sibling::td/div/div/div/text()").extract()
        }


if __name__ == "__main__":
    process = CrawlerProcess(settings={
        'DOWNLOAD_HANDLERS': {
            "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        'TWISTED_REACTOR' :"twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    })
    process.crawl(Spider)
    process.start()

Below is a sample log of running the scraper using python crawler.py. The year 2010 is shown in the output

Scrapy log

msenior_
  • 1,913
  • 2
  • 11
  • 13
  • Hello @msenior, i built in the changes you suggested and also tried around with it in the `scrapy shell` but it didn't work =/ – TheGoldBerg Oct 20 '21 at 13:16