2

Hi I'm using python mechanize to get datas from webpages. I'm trying to get imgurl from google image search webpage to download search result images.

Here's my code I fill search form as 'dog' and submit. (search 'dog')

import mechanize
import cookielib
import urllib2
import urllib

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time = 1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1'), ('Accept', '*/*') ,('Accept-Language', 'ko-KR')]

br.open('http://www.google.com/imghp?hl=en')
br.select_form(nr=0)
br.form['q'] = 'dog'
a = br.submit()
searched_url = br.geturl()

file0 = open("1.html", "wb")
file0.write(a.read())
file0.close()

when i see page-source from chrome browser, there are 'imgurl's in pagesource. But when i read data from python mechanize, there's no such things. also, the size of 1.html(which i write by python) is much smaller than html file downloaded from chrome. How can i get exactly same html data as web-browsers by using python?

Do i have to set request headers same as web-browsers? thanks

p4vi4n
  • 21
  • 3
  • What you see in the "regular" page is loaded with JavaScript. Look at the AJAX requests sent out by the browser while browsing the page and you'll see how to get the images. – Blender Dec 28 '13 at 08:10
  • You could use selenium to load full page – 4d4c Dec 28 '13 at 10:12
  • Thanks. I used selenium and it works. – p4vi4n Dec 29 '13 at 10:24
  • I have the same problem, and using selenium is not an option for me. Can someone point out how to scrape the page source after the js does its magic?? – gixxer Jul 19 '16 at 23:36

0 Answers0