1

EDIT: Yes, when I print the links in the second for loop, I see the correct links in the python console.

I am trying to scrape links from search engine and then follow those links to scrape additional data, however my code will not follow the scraped links and returns "not found." What am I doing wrong?

import urllib.request
import bs4 as bs


#iterate the pages in the search result
for page in range(0,count, 100):
    source = urllib.request.urlopen('www.mywebsite.com').read()
    soup = bs.BeautifulSoup(source, 'lxml')

    #data to be captured
    url = ['www.mywebsite.com'+a.get('href') for a in soup.find_all('a', {'class' : ['result-title hdrlnk']})]
    postdate = [pd.get('title') for pd in soup.find_all('time', {'class' : ['result-date']})]
    price = [span.string for span in soup.find_all('span', {'class' : ['result-price']})]
    bedroom = [span.get_text(strip=True).strip() for span in soup.find_all('span', {'class' : ['housing']})]

#follow the links returned in the search result to scrape additional data

for link in url:  
    print(link) #this displays each link properly in the console
    source2 = urllib.request.urlopen(link).read()
    soup2 = bs.BeautifulSoup(source2,'lxml')
Martin Gergov
  • 1,556
  • 4
  • 20
  • 29
skellyboy
  • 65
  • 6
  • 3
    if you print url before the for loop, do you see correct URLs? Some unsolicited advice, if possible try using [requests](http://docs.python-requests.org/en/master/) instead of urllib – Aditya Nov 21 '16 at 03:32
  • Related to @Aditya's suggestion, if you print `link` each time through your for loop, are you seeing the correct urls? – elethan Nov 21 '16 at 03:35
  • ^ Second `requests`. The only thing that `urllib` is better for imo is the nifty `urlretrieve` method. – Pythonista Nov 21 '16 at 03:36
  • @ Aditya yes the correct link prints in the python console, however it returns an HTTP: not found error when I try to follow it with BS – skellyboy Nov 21 '16 at 03:36
  • Possible duplicate of http://stackoverflow.com/questions/12302304/urllib2-returns-404-for-a-website-which-displays-fine-in-browsers – Aditya Nov 21 '16 at 03:46

0 Answers0