How to navigate through HTMl pages that have paging for their content using Python?

Question

I want to crawl all the table entries(table that describes the S/No. , Document No., etc.) from the following website and write it to excel. So far, I am able to crawl the data from the first page (10 entries) only. Can anyone please help me with the python piece of code to crawl the data from first to last page on this website.

Website: https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId

My python code:

from bs4 import BeautifulSoup
import requests
import sys
import mechanize
import pprint
import re
import csv
import urllib
import urllib2

browser = mechanize.Browser()
browser.set_handle_robots(False)
url = 'https://www.gebiz.gov.sg/scripts/main.do?sourceLocation=openarea&select=tenderId'
response = browser.open(url)
html_doc = response.read()

rows_list = []
table_dict = {}

soup = BeautifulSoup(html_doc)

table = soup.find("table", attrs={"width": "100%", "border": "0", "cellspacing": "2", "cellpadding": "3", "bgcolor": "#FFFFFF"})
tr_elements = table.find_all("tr", class_=re.compile((ur'(row_even|row_odd|header_subone)')))

table_rows = []

for i in range(0, len(tr_elements)):
    tr_element = tr_elements[i]
    td_elements_in_tr_element = tr_element.find_all("td")
    rows_list.append([])

    for j in range(0, len(td_elements_in_tr_element)):
        td_element = td_elements_in_tr_element[j]
        table_elements_in_td_element = td_element.find_all("table")

    if len(table_elements_in_td_element) > 0:
                   continue
                   rows_list[i].append(td_element.text)
                   pprint.pprint(len(table_elements_in_td_element))
pprint.pprint(rows_list)

rows_list.remove([])

for row in rows_list:
table_dict[row[0]] = {
            #'S/No.' : row[1],
    'Document No.': row[1] + row[2],
        'Tenders and Quotations': row[3] + row[4],
    'Publication Date': row[5],
    'Closing Date': row[6],
    'Status': row[7]
}

pprint.pprint(table_dict)

with open('gebiz.csv', 'wb') as csvfile:
    csvwriter = csv.writer(csvfile, dialect='excel')

    for key in sorted(table_dict.iterkeys()):
         csvwriter.writerow([table_dict[key]['Document No.'], table_dict[key]['Tenders and Quotations'], table_dict[key]['Publication Date'], table_dict[key]['Closing Date'], table_dict[key]['Status']])

Every help from your side will be highly appreciated.

score 1 · Accepted Answer · edited May 23 '17 at 11:46

1

As I can see in this page, you need to interact with java script that is invoked by button Go or Next Page button. For Go button you need to fill the textbox each time. You can use different approaches to work around this:

1) Selenium - Web Browser Automation

2) spynner - Programmatic web browsing module with AJAX support for Python and also take look here

3) If you are familiar with c#, it also provide a webBrowser component that helps you to click on the html elements. (e.g. here). You save html content of each page and later on crawl them from offline pages.

edited May 23 '17 at 11:46

Community

1
1

answered Nov 13 '14 at 21:55

Nima Soroush

12,242
4
52
53

Thank you very much Nima Soroush for providing all the references. Unfortunately, I have never had a hands on experience with java script or c#. I am trying to work out with spynner.. in case of any queries, will post a comment here again. thank you! – user3538508 Nov 14 '14 at 03:59
You're welcome. Here you can find some useful example (http://jimmyromanticdevil.wordpress.com/2011/04/03/python-spynner-programmatic-web-browser-module-part-1/) – Nima Soroush Nov 14 '14 at 08:47

How to navigate through HTMl pages that have paging for their content using Python?

1 Answers1

Linked