I opened this link "http://www.amazon.com/s?rh=n%3A1" with urllib2 and i was trying to fetch the next page link (href="/s/ref=lp_1_pg_2?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1&page=2&ie=UTF8&qid=1376769633") avaible in the html text. Howevar the read() function keeps on reading this part as (href="/s?rh=n%3A1&page=2") who doesn't work. There's any way to make the read function read the link correctly?
Asked
Active
Viewed 463 times
0
-
2Could you post your code please? – rlms Aug 17 '13 at 20:06
-
I just did this and printed the result: response=urllib2.urlopen(link) html = response.read() The link that i got for the next page I got it by seeing the source code of the page – Findios Aug 17 '13 at 20:08
-
are you want next page url? – mccakici Aug 17 '13 at 20:16
-
Yeah, the one in the source code associated to the "next page" button. When urllib reads it it reads it wrong :/ – Findios Aug 17 '13 at 20:17
-
I couldn't find either (href="/s/ref=lp_1_pg_2?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1&page=2&ie=UTF8&qid=1376769633" or href="/s?rh=n%3A1&page=2" in the HTML when I got it with urllib.request, or when I looked at the source in with my browser. – rlms Aug 17 '13 at 20:24
2 Answers
2
It does it because you don't have headers. I tried:
from mechanize import Browser
from bs4 import BeautifulSoup
browser = Browser()
html_page = browser.open("http://www.amazon.com/s?rh=n%3A1")
soup = BeautifulSoup(html_page)
link = soup.find("a", {"title" : "Next Page"})
print link
Output:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s?rh=n%3A1&page=2">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
Then I added headers:
from mechanize import Browser
from bs4 import BeautifulSoup
browser = Browser()
browser.addheaders = [('User-agent', 'Mozilla/5.0\
(Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko)\
Chrome/23.0.1271.97 Safari/537.11')]
html_page = browser.open("http://www.amazon.com/s?rh=n%3A1")
soup = BeautifulSoup(html_page)
link = soup.find("a", {"title" : "Next Page"})
print link
Output:
<a title="Next Page" id="pagnNextLink" class="pagnNext" href="/s/ref=lp_1_pg_2/177-4872792-4084836?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1&page=2&ie=UTF8&qid=1376771097">
<span id="pagnNextString">Next Page</span>
<span class="srSprite pagnNextArrow"></span>
</a>
So just add header information like this
Example:
from bs4 import BeautifulSoup
import urllib2
req = urllib2.Request("http://www.amazon.com/s?rh=n%3A1")
req.add_header('User-agent', 'Mozilla/5.0\
(Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko)\
Chrome/23.0.1271.97 Safari/537.11')
html_page = urllib2.urlopen(req)
if html_page.getcode() == 200:
soup = BeautifulSoup(html_page)
link = soup.find("a", {"title" : "Next Page"})
print link['href']
else:
print "Error loading page"
Output:
/s/ref=lp_1_pg_2/176-2670743-2970243?rh=n%3A283155%2Cn%3A%211000%2Cn%3A1&page=2&ie=UTF8&qid=1376771750
0
try,
import urllib2
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
href = []
def handle_starttag(self, tag, attrs):
if tag == "a":
for attr in attrs:
if attr[0] == "href" and 'page' in attr[1] and 'rh' in attr[1]:
self.href.append(attr[1])
def _get(url):
response = urllib2.urlopen(url)
html = response.read()
parser = MyHTMLParser()
parser.feed(html.decode('utf-8'))
href = parser.href
print href
_get('http://www.amazon.com/s?rh=n%3A1')

mccakici
- 550
- 3
- 7