1

Suppose I am scraping a url

http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha

and it contents no of pages which contains the data which I want to scrape. So how can I scrape the data of all the next pages. I am using python 3.5.1 and Beautifulsoup. Note: I can't use scrapy and lxml as it is giving me some installation error.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Aman Kumar
  • 1,572
  • 4
  • 17
  • 25

1 Answers1

4

Determine the last page by extracting the page argument of the "Go to the last page" element. And loop over every page maintaining a web-scraping session via requests.Session():

import re

import requests
from bs4 import BeautifulSoup


with requests.Session() as session:
    # extract the last page
    response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha")    
    soup = BeautifulSoup(response.content, "html.parser")
    last_page = int(re.search("page=(\d+)", soup.select_one("li.pager-last").a["href"]).group(1))

    # loop over every page
    for page in range(last_page):
        response = session.get("http://www.engineering.careers360.com/colleges/list-of-engineering-colleges-in-India?sort_filter=alpha&page=%f" % page)
        soup = BeautifulSoup(response.content, "html.parser")

        # print the title of every search result
        for result in soup.select("li.search-result"):
            title = result.find("div", class_="title").get_text(strip=True)
            print(title)

Prints:

A C S College of Engineering, Bangalore
A1 Global Institute of Engineering and Technology, Prakasam
AAA College of Engineering and Technology, Thiruthangal
...
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks I have Learn a lot from you. – Aman Kumar Mar 15 '16 at 16:12
  • hi there alecxe - many thanks for this great idea and example .- i run this on ATOM on a MX--linux: i get back annoying errors... ` Traceback (most recent call last): File "/tmp/atom_script_tempfiles/bb9dd230-6d13-11ea-905d-13b9ee9fe090", line 9, in engineering.careers NameError: name 'engineering' is not defined [Finished in 1.333s]` any idea what goes on here? – zero Mar 23 '20 at 14:38