4

So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:

import scrapy
from ..items import AmazonsItem

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']

    def parse(self, response):
        items =  AmazonsItem()
        products_name = response.css('.s-access-title::attr("data-attribute")').extract()
        for product_name in products_name:
            print(product_name)
        next_page = response.css('li.a-last a::attr(href)').get()
            if next_page is not None:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)

Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.

  1. So how do I scrape a website which has dynamic content?
  2. what exactly is the difference between dynamic and static content?
  3. How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
  4. how would I know that data is dynamically created?
Srikant Singh
  • 149
  • 1
  • 5
  • 17
  • Dynamic data is injected into the page, you need something like Selenium to wait until the whole page is loaded and then apply your xpaths. Alternatively you can "simulate" the page load and make the requests yourself to get the data, parse the data and put it all together. – Hugo Sousa Apr 16 '19 at 13:43

4 Answers4

8

So how do I scrape a website which has dynamic content?

there are a few options:

  1. Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
  2. Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
  3. Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format

what exactly is the difference between dynamic and static content?

Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site

How do I extract other information like price and image from the website? and how to get particular classes for example like a price?

Refer to your first question

how would I know that data is dynamically created?

You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR

Lastly

Amazon does offer an API to access the data. Try looking into that as well

chitown88
  • 27,527
  • 4
  • 30
  • 59
0

If you want to load dynamic content, you will need to simulate a web browser. When you make an HTTP request, you will only get the text returned by that request, and nothing more. To simulate a web browser, and interact with data on the browser, use the selenium package for Python:

https://selenium-python.readthedocs.io/

joedeandev
  • 626
  • 1
  • 6
  • 15
0

So how do I scrape a website which has dynamic content?

Websites that have dynamic content have their own APIs from where they are pulling data. That data is not even fixed it will be different if you will check it after some time. But, it does not mean that you can't scrape a dynamic website. You can use automated testing frameworks like Selenium or Puppeteer.

what exactly is the difference between dynamic and static content?

As I have explained this in your first question, the static data is fixed and will remain the same forever but the dynamic data will be periodically updated or changes asynchronously.

How do I extract other information like price and image from the website? and how to get particular classes for example like a price?

for that, you can use libraries like BeautifulSoup in python and cheerio in Nodejs. Their docs are quite easy to understand and I will highly recommend you to read them thoroughly. You can also follow this tutorial

how would I know that data is dynamically created?

While reloading the page open the network tab in chrome dev tools. You will see a lot of APIs are working behind to provide the relevant data according to the page you are trying to access. In that case, the website is dynamic.

0

So how do I scrape a website which has dynamic content?

To scrape the dynamic content from websites, we are required to let the web page load completely, so that the data can be injected into the page.

What exactly is the difference between dynamic and static content?

Content in static websites is fixed content that is not processed on the server and is directly returned by using prebuild source code files. Dynamic websites load the contents by processing them on the server side in runtime. These sites can have different data every time you load the page, or when the data is updated.

How would I know that data is dynamically created?

You can open the Dev Tools and open the Networks tab. Over there once you refresh the page, you can look out for the XHR requests or requests to the APIs. If some requests like those exist, then the site is dynamic, else it is static.

How do I extract other information like price and image from the website? and how to get particular classes for example like a price?

To extract the dynamic content from the websites we can use Selenium (python - one of the best options) :

  • Selenium - an automated browser simulation framework You can load the page, and use the CSS selector to match the data on the page. Following is an example of how you can use it.
import time
from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6")
time.sleep(4)
titles = driver.find_elements_by_css_selector(
    ".a-size-medium.a-color-base.a-text-normal")

print(titles[0].text)

In case you don't want to use Python, there are other open-source options like Puppeteer and Playwright, as well as complete scraping platforms such as Bright Data that have built-in capabilities to extract dynamic content automatically.

Gidoneli
  • 123
  • 8