Python urlopen and httplib both are unable to return the actual html of the page

Question

I am trying to read information from this page: http://movie.douban.com/subject/20645098/comments

and use the following to find all the comment items.

comment_item = soup.find_all("div", {"id":"comment"})

However, I was unable to get anything returned and I realized the html that my script is reading is different than the html on the actual page. Below is what I have tried.

I first tried to use BeautifulSoup do the following:

comment_html = urlopen(section_url).read()
soup = BeautifulSoup(comment_html, "html.parser")

And the html that soup returns is not the same as the actual html. Then I tried httplib2 request as the following:

h = httplib2.Http()
resp, content = h.request(section_url, "GET")
soup = BeautifulSoup(content, "html.parser")

And it is still the same. :(

you should add all http request headers you are sending from browser to your http request in python.. that should solve problem. — Amey Jadiye, Oct 19 '15 at 17:47
What do you consider as the "actual html". If the site makes heavy use of Javascript then the DOM can be completely different from the basic HTML that you get with a simple GET request. — rkrzr, Oct 19 '15 at 17:50
@rkrzr I am looking for the main content that the users would see on the webpage. For instance, I am unable to find the div with comment as id in the returned html. — YAL, Oct 19 '15 at 17:52
@AmeyJadiye: I am new to learning to scrape data. Could you give examples as of how to do it? I am unclear of what other http request I should be doing. Thanks. — YAL, Oct 19 '15 at 17:53
But why screenscrape it? They [have an API](https://translate.googleusercontent.com/translate_c?depth=1&hl=en&rurl=translate.google.com&sl=zh-CN&tl=en&u=http://developers.douban.com/wiki/%3Ftitle%3Dmovie_v2) which includes movie comments, appears to be free to apply for personal use at 40/requests per minute, and to apply for non-competing, non-commercial use. — TessellatingHeckler, Oct 19 '15 at 18:13

score 1 · Accepted Answer · answered Oct 19 '15 at 18:05

1

Here is a working example:

import requests
import BeautifulSoup as BeautifulSoup

url = 'http://movie.douban.com/subject/20645098/comments'
resp = requests.get(url)
b = BeautifulSoup(resp.text)
comments = b.findAll('div', {'class': 'comment'})

print comments

I used the requests library here, which I would highly recommend you use as well, but it has nothing to do with your problem. The problems with your code are the wrong method name (find_all) and that you want to look for a class and not for an id.

answered Oct 19 '15 at 18:05

rkrzr

1,842
20
31

THANK YOU GUYS SO MUCH! :D Wow Stackoverflow is the best, I didn't expect to get responses so fast! Thanks guys! – YAL Oct 19 '15 at 18:19
@rkrzr: For some reason your code works for the link I have on this post, however, it doesn't work for other link like this one: http://movie.douban.com/subject/2303845/comments Any idea why that is? – YAL Oct 19 '15 at 18:34
@AmeyJadiye For some reason your code works for the link I have on this post, however, it doesn't work for other link like this one: movie.douban.com/subject/2303845/comments Any idea why that is? – YAL Oct 19 '15 at 18:50
@YAL that link just redirects (302) back to http://movie.douban.com/. You probably want a different link. – rkrzr Oct 19 '15 at 18:51
yeah, if you get 302 you just hit the `response.url` again to get redirected stuff. – Amey Jadiye Oct 19 '15 at 18:54

Python urlopen and httplib both are unable to return the actual html of the page

1 Answers1