1

Let's say I want to scrape the data here.

I can do it nicely using urlopen and BeautifulSoup in Python 2.7.

Now if I want to scrape data from the second page with this address.

What I get is the data from the first page! I looked at the page source of the second page using "view page source" of Chrome and the content belongs to first page!

How can I scrape the data from the second page?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
TJ1
  • 7,578
  • 19
  • 76
  • 119

1 Answers1

2

The page is of a quite asynchronous nature, there are XHR requests forming the search results, simulate them in your code using requests. Sample code as a starting point for you:

from bs4 import BeautifulSoup
import requests

url = 'http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/#2'
ajax_url = "http://www.amazon.com/Best-Sellers-Books-Architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2"

def get_books(data):
    soup = BeautifulSoup(data)

    for title in soup.select("div.zg_itemImmersion div.zg_title a"):
        print title.get_text(strip=True)


with requests.Session() as session:
    session.get(url)

    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'X-Requested-With': 'XMLHttpRequest'
    }

    for page in range(1, 10):
        print "Page #%d" % page

        params = {
            "_encoding": "UTF8",
            "pg": str(page),
            "ajax": "1"
        }
        response = session.get(ajax_url, params=params)
        get_books(response.content)

        params["isAboveTheFold"] = "0"
        response = session.get(ajax_url, params=params)
        get_books(response.content)

And don't forget to be a good web-scraping citizen and follow the Terms of Use.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Woow great answer and it works beautifully. Two questions: 1. how did you find out about XHR requests? 2. Why did you use the address of second page .../#2 in the `url` and `ajax_url` instead of first page? – TJ1 May 14 '15 at 12:50
  • one more question: In the abobe function: `get_books` if I add a line in the for loop so in addition of printing the book title to the screen it writes it in a file, then things get messed up and not all titles are printed on the screen and nor to the file. Is there a time sensitive item in this code? – TJ1 May 15 '15 at 12:30
  • 1
    @TJ1 1. I've used browser developer tools (network tab) 2. It's probably a typo, though `ajax_url` is what matters - the data is loaded via ajax – alecxe May 15 '15 at 12:44
  • @TJ1 3. I have no idea at the moment. It would be better to make a separate question so that more people can help if you have difficulties. Thanks. – alecxe May 15 '15 at 12:45
  • Alecxe: can you please try this yourself? Just see if you write the data to a file what happens? – TJ1 May 15 '15 at 12:46
  • Also what can I use as the title of the new question? Thanks so much for the help. – TJ1 May 15 '15 at 12:47
  • 1. I have posted a question about this here: http://stackoverflow.com/questions/30260110/python-scrapping-a-website-and-printing-to-a-file-conflict I appreciate it if you kindly take a look. – TJ1 May 15 '15 at 12:57