2

when I want to get the page using urllib2, I don't get the full page.

here is the code in python:

import urllib2
import urllib
import socket
from bs4 import BeautifulSoup
# define the frequency for http requests
socket.setdefaulttimeout(5)

    # getting the page
def get_page(url):
    """ loads a webpage into a string """
    src = ''

    req = urllib2.Request(url)

    try:
        response = urllib2.urlopen(req)
        src = response.read()
        response.close()
    except IOError:
        print 'can\'t open',url 
        return src

    return src

def write_to_file(soup):
    ''' i know that I should use try and catch'''
    # writing to file, you can check if you got the full page
    file = open('output','w')
    file.write(str(soup))
    file.close()



if __name__ == "__main__":
            # this is the page that I'm trying to get
    url = 'http://www.imdb.com/title/tt0118799/'
    src = get_page(url)

    soup = BeautifulSoup(src)

    write_to_file(soup)    # open the file and see what you get
    print "end"

I have struggling to find the problem the whole week !! why I don't get the full page?

thanks for help

  • 1
    I strongly recommend using the fantastic [python-requests](http://docs.python-requests.org) library instead of urllib/urllib2. – Danilo Bargen Apr 11 '12 at 08:49
  • What did you mean not getting the full page ? What did you get ? – Kien Truong Apr 11 '12 at 09:20
  • 1
    Do you "get the full page" if you write `src` to a file before feeding it into `BeautifulSoup`? If so, `BeautifulSoup` might be omitting parts of the HTML source in order to be able to parse it correctly. – Simon Apr 11 '12 at 09:59
  • @simon you're right how can get the whole page despite the use of bs4? –  Apr 11 '12 at 12:16
  • Why are you using BeautifulSoup in the first place? Right now, your code just sticks the source in and immediately serializes it out again. That doesn't make much sense... – Simon Apr 11 '12 at 14:25

2 Answers2

2

You might have to call read multiple times, as long as it does not return an empty string indicating EOF:

def get_page(url):
    """ loads a webpage into a string """
    src = ''

    req = urllib2.Request(url)

    try:
        response = urllib2.urlopen(req)
        chunk = True
        while chunk:
            chunk = response.read(1024)
            src += chunk
        response.close()
    except IOError:
        print 'can\'t open',url 
        return src

    return src
mensi
  • 9,580
  • 2
  • 34
  • 43
  • @aminonsh does it change anything if you specify an explicit chunk size? (I modified my answer) – mensi Apr 11 '12 at 11:52
  • @aminonsh and you are 100% sure that src is incomplete before any of the beautiful soup parsing? Have you tried doing wget on the same URL and compare the downloaded file with the contents of src? You should not compare with the source shown in your browser, since the site might do browser detection or modify code with javascript – mensi Apr 11 '12 at 12:02
  • 1. I was comparing it using my browsing. let me be clear: 1. if I look at the source from the browser I can get (

    Known For

    ) tag 2. but after opening it with gedit (data received) from python, I don't get (

    Known For

    ) tag 3. that's mean I'm getting not full page !!!
    –  Apr 11 '12 at 12:06
  • so if you do `write_to_file(get_page(url))` and compare the resulting file it is the first N bytes of the file you get with `wget URL` on the console? How big is N? – mensi Apr 11 '12 at 12:09
  • ok, my faul. bs4 making broplems. I get the whole response but when use soup = BeautifulSoup(src) I get half of page !! why? –  Apr 11 '12 at 12:13
  • @aminonsh Beautiful Soup tries to correct mistakes in the XML Tree and might cut off stuff because it runs into errors. BTW there is an API for imdb if you just want to retrieve data about movies and stuff, there is even a Python binding – mensi Apr 11 '12 at 12:17
  • @aminonsh I think I tried out this one and it worked fine: [imdbpy](http://imdbpy.sourceforge.net/) – mensi Apr 11 '12 at 12:21
0

I had the same problem, I though it was urllib but it was bs4.

Instead of use

BeautifulSoup(src)

or

soup = bs4.BeautifulSoup(html, 'html.parser')

try use

soup = bs4.BeautifulSoup(html, 'html5lib')
titusfx
  • 1,896
  • 27
  • 36