0

I am trying to download file from the website www.nsf.gov. In the browser, first I have to make a search request.Then, I have to click on the export file option to download the file.

If, I try to do it manually,first I have to paste the url of the search request.Then,I need to paste the export url in the browser.If I do not do the first process,it gives me the following message :

Server Error

This server has encountered an internal error which prevents it from fulfilling your request. The most likely cause is a misconfiguration. Please ask the administrator to look for messages in the server's error log.

So, programatically using Webkit I do the following,but still it gives me the following error:

urllib2.HTTPError: HTTP Error 500: Server Error.

Kindly help me...I am struggling with this for a week now. Here is my code :

#!/usr/bin/env python

import sys
import signal

from optparse import OptionParser
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import QWebPage

from bs4 import BeautifulSoup
import urllib2
import shutil
import urlparse
import os

class Crawler( QWebPage ):
    def __init__(self,url_name,file_name):
        QWebPage.__init__( self )
        self._url =   url_name      
    self._file = file_name

    def crawl( self ):
        signal.signal( signal.SIGINT, signal.SIG_DFL )
        self.connect( self, SIGNAL( 'loadFinished(bool)' ), self._finished_loading )
        self.mainFrame().load( QUrl( self._url ) )

    def _finished_loading( self, result ):
        file = open( self._file, 'w' )
        file.write( self.mainFrame().toHtml() )
        file.close()
    self.process( self.mainFrame().toHtml())
    file_download('http://www.nsf.gov/awardsearch/ExportResultServlet?exportType=txt','result.txt')
        sys.exit( 0 )

    def process(self,content):

    html_doc=content
    soup = BeautifulSoup(html_doc)
    soup=soup.prettify()


def main():
    url_name='http://www.nsf.gov/awardsearch/advancedSearchResult?PIId=&PIFirstName=&PILastName=&PIOrganization=&PIState=&PIZip=&PICountry=&ProgOrganization=&ProgEleCode=&BooleanElement=All&ProgRefCode=&BooleanRef=All&Program=&ProgOfficer=&Keyword=&AwardNumberOperator=Range&AwardNumberFrom=1&AwardNumberTo=20000&AwardAmount=&AwardInstrument=&ActiveAwards=true&OriginalAwardDateOperator=&StartDateOperator=&ExpDateOperator='
    file_name='NSF Award Search: Advanced Search Results1.html'
    app = QApplication( sys.argv )
    crawler = Crawler(url_name,file_name)
    crawler.crawl()
    sys.exit( app.exec_() )

def file_download(url, fileName):

    r = urllib2.urlopen(urllib2.Request(url))
    try:
        fileName = fileName
        with open(fileName, 'wb') as f:
            shutil.copyfileobj(r,f)
    finally:
        r.close()

if __name__ == '__main__':
    main()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
user2284140
  • 197
  • 1
  • 4
  • 18
  • 500 is a problem with the server, not your client. – Cairnarvon Apr 15 '13 at 21:31
  • well, your post says you can't download the file manually - and if you can't do it manually, you can't automate it either. maybe 'copypasting the export url' isn't a proper way to download from the web site. – thkang Apr 15 '13 at 21:37
  • I think the server is blocking me....because if I manually enter the following url [1] in browser followed by this url [2],it works.But if I enter the url[2] in the browser without entering url[1] and clearing browser history it does not work.How to resolve this issue programatically. – user2284140 Apr 15 '13 at 21:40
  • url[1]:http://www.nsf.gov/awardsearch/advancedSearchResult?PIId=&PIFirstName=&PILastName=&PIOrganization=&PIState=&PIZip=&PICountry=&ProgOrganization=&ProgEleCode=&BooleanElement=All&ProgRefCode=&BooleanRef=All&Program=&ProgOfficer=&Keyword=&AwardNumberOperator=Range&AwardNumberFrom=1&AwardNumberTo=20000&AwardAmount=&AwardInstrument=&ActiveAwards=true&OriginalAwardDateOperator=&StartDateOperator=&ExpDateOperator= url[2]:http://www.nsf.gov/awardsearch/ExportResultServlet?exportType=txt – user2284140 Apr 15 '13 at 21:42
  • Guessing that you are dealing with a stateful servelet, so you can only export search results from the context of a session currently having results to export. That 1st url doesn't actually work either though, since none of the arguments are populated. – Silas Ray Apr 15 '13 at 21:47
  • what do you mean 'none of the arguments are populated'. It is giving me one result to export. – user2284140 Apr 15 '13 at 21:52

1 Answers1

0

You can do that, but you need first set the 'queryText' var, you need this string var to search content...
You can test with this URL, where 'body' is the query string:

http://www.nsf.gov/awardsearch/ExportResultServlet?exportType=txt&queryText=body&ActiveAwards=true

Then you need use this arg at your code to get the data, let's call query:

def _finished_loading( self, result, query ):
    URL = 'http://www.nsf.gov/awardsearch/ExportResultServlet?exportType=txt&queryText=' + query + '&ActiveAwards=true'
    file = open( self._file, 'w' )
    file.write( self.mainFrame().toHtml() )
    file.close()
    self.process( self.mainFrame().toHtml())
    file_download(URL,'result.txt')
    sys.exit( 0 )
chespinoza
  • 2,638
  • 1
  • 23
  • 46
  • According to you I used the following url: http://www.nsf.gov/awardsearch/ExportResultServlet?exportType=txt&queryText=PIId=&PIFirstName=&PILastName=&PIOrganization=&PIState=&PIZip=&PICountry=&ProgOrganization=&ProgEleCode=&BooleanElement=All&ProgRefCode=&BooleanRef=All&Program=&ProgOfficer=&Keyword=&AwardNumberOperator=Range&AwardNumberFrom=1&AwardNumberTo=20000&AwardAmount=&AwardInstrument=&ActiveAwards=true&OriginalAwardDateOperator=&StartDateOperator=&ExpDateOperator=&ActiveAwards=true But still I gate the same error... – user2284140 Apr 15 '13 at 22:33