urlopen always retrieves the same webpage

Question

I am trying to parse webpages using urllib2, BeautifulSoup and Python 2.7.

The problem lies upstream: each time I try to retrieve a new webpage, I get the one I already retrieved. However, pages are different in my webbrowser: see page 1 and page 2. Is there something wrong with the loop over page numbers?

Here is a code sample:

def main(page_number_max):
    import urllib2 as ul
    from BeautifulSoup import BeautifulSoup as bs

    base_url = 'http://www.senscritique.com/clement/collection/#page='

    for page_number in range(1, 1+page_number_max):
        url = base_url + str(page_number) + '/'
        html = ul.urlopen(url)
        bt = bs(html)

        for item in bt.findAll('div', 'c_listing-products-content xl'):
            item_name = item.findAll('h2', 'c_heading c_heading-5 c_bold')
            print str(item_name[0].contents[1]).split('\t')[11]

        print('End of page ' + str(page_number) + '\n')

if __name__ == '__main__':
    page_number_max = 2
    main(page_number_max)

You're setting the page with the hash parameter `page`, but it will only work with javascript, and in your case i think you're using a curl like library to load the pages. Look at what urls the website use to load page 1 or 2 in ajax etc — AdrienBrault, Jul 08 '12 at 12:33

score 2 · Accepted Answer · answered Jul 08 '12 at 13:17

2

When you send http request to server, everything after "#" character is ignored. The part after "#" is only available to browser.

If you open developer tools in Chrome browser (or open firebug in Firefox) you will see that everytime you change page on senscritique.com there is request sent to the server. That's where the data you are looking for comes from.

I'm not going into details about what exacly to send in order to retrieve data from this page, because I think it's not consistent with their TOS.

answered Jul 08 '12 at 13:17

Qrees

36
1
3

I remember I could use a similar code in the past. In the mean time, I lost the code and I believe the website was updated. Which raises the question: why did it work at that time? – Wok Jul 08 '12 at 13:30
About TOS, I just want to save my ratings, in case the website goes down. – Wok Jul 08 '12 at 13:33
Thanks. I could manage thanks to your pointing the dièse (sharp) character. – Wok Jul 08 '12 at 13:39

score 1 · Answer 2 · answered Mar 26 '13 at 13:43

1

"#" is the anchor tag used to identify and jump to specific parts of the document.The browser does it so when you send the request the whole web page is loaded while the rest is ignored.

answered Mar 26 '13 at 13:43

devsaw

1,007
2
14
28

urlopen always retrieves the same webpage

2 Answers2