4

I’m trying to scrape data from http://portal.uspto.gov/EmployeeSearch/ web site. I open the site in browser, click on the Search button inside the Search by Organisation part of the site and look for the request being sent to server.

When I post the same request using python requests library in my program, I don’t get the result page which I am expecting but I get the same Search page, with no employee data on it. I’ve tried all variants, nothing seems to work.

My question is, what URL should I use in my request, do I need to specify headers (tried also, copied headers viewed in Firefox developer tools upon request) or something else?

Below is the code that sends the request:

import requests
from bs4 import BeautifulSoup

def scrape_employees():
    URL = 'http://portal.uspto.gov/EmployeeSearch/searchEm.do;jsessionid=98BC24BA630AA0AEB87F8109E2F95638.prod_portaljboss4_jvm1?action=displayResultPageByOrgShortNm&currentPage=1'

    response = requests.post(URL)

    site_data = response.content
    soup = BeautifulSoup(site_data, "html.parser")
    print(soup.prettify())


if __name__ == '__main__':
scrape_employees()
narog
  • 65
  • 1
  • 11
  • You should use an API instead of screen scraping if at all possible. The USPTO APIs are documented [here](https://developer.uspto.gov/api-catalog). – ThisSuitIsBlackNot Mar 05 '17 at 21:47
  • Thanks for the suggestion @ThisSuitIsBlackNot. Unfortunately I can't get all the data I need (names of employees) from USPTO's APIs... – narog Mar 06 '17 at 08:38
  • That's too bad. Anyway, it looks like you forgot to put `orgShortNm=foo` in the request body. – ThisSuitIsBlackNot Mar 07 '17 at 00:54

1 Answers1

3

All the data you need is in a form tag:enter image description here

action is the url when you make a post to server.

input is the data you need post to server. {name:value}

import requests, bs4, urllib.parse,re

def make_soup(url):
    r = requests.get(url)
    soup = bs4.BeautifulSoup(r.text, 'lxml')
    return soup

def get_form(soup):
    form = soup.find(name='form', action=re.compile(r'OrgShortNm'))
    return form

def get_action(form, base_url):
    action = form['action']
    # action is reletive url, convert it to absolute url
    abs_action = urllib.parse.urljoin(base_url, action)
    return abs_action

def get_form_data(form, org_code):
    data = {}
    for inp in form('input'):
        # if the value is None, we put the org_code to this field
        data[inp['name']] = inp['value'] or org_code

    return data

if __name__ == '__main__':
    url = 'http://portal.uspto.gov/EmployeeSearch/'
    soup = make_soup(url)
    form = get_form(soup)
    action = get_action(form, url)
    data = get_form_data(form, '1634')

    # make request to the action using data

    r = requests.post(action, data=data)
宏杰李
  • 11,820
  • 2
  • 28
  • 35