1

I have a static .aspx url that I am trying to scrape. All of my attempts yield the raw html data of the regular website instead of the data I am querying.

My understanding is the headers I am using (which I found from another post) are correct and generalizable:

import urllib.request
from bs4 import BeautifulSoup

headers = {
    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko)  Chrome/24.0.1312.57 Safari/537.17',
    'Content-Type': 'application/x-www-form-urlencoded',
    'Accept-Encoding': 'gzip,deflate,sdch',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3'
}

class MyOpener(urllib.request.FancyURLopener):
    version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'

myopener = MyOpener()
url = 'https://www.mytaxcollector.com/trSearch.aspx'
# first HTTP request without form data
f = myopener.open(url)
soup_dummy = BeautifulSoup(f,"html5lib")
# parse and retrieve two vital form values
viewstate = soup_dummy.select("#__VIEWSTATE")[0]['value']
viewstategen = soup_dummy.select("#__VIEWSTATEGENERATOR")[0]['value']

Trying to enter the form data causes nothing to happen:

formData = (
    ('__VIEWSTATE', viewstate),
    ('__VIEWSTATEGENERATOR', viewstategen),
    ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000'),
    ('__EVENTTARGET', 'ct100$MainContent$calculate')
)

encodedFields =  urllib.parse.urlencode(formData)
# second HTTP request with form data
f = myopener.open(url, encodedFields)


soup = BeautifulSoup(f,"html5lib")
trans_emissions = soup.find("span", id="ctl00_MainContent_transEmissions")
print(trans_emissions.text)

This give raw html code almost exactly the same as the "soup_dummy" variable. But what I want to see is the data of the field ('ctl00_contentHolder_trSearchCharactersAPN', '631091430000') being submitted (this is the "parcel number" box.

I would really appreciate the help. If anything, linking me to a good post about HTML requests (one that not only explains but actually walks through scraping aspx) would be great.

Elliot Huebler
  • 127
  • 2
  • 12
  • The parcel number `631091430000` that you have mentioned in your post can't produce any result in this [page](https://www.mytaxcollector.com/trSearch.aspx). Is there anything to do along with putting that parcel number to populate the result? – SIM Jul 10 '20 at 06:52
  • The parcel number `631091430000` that you mentioned is only 12 digits but should be 13 - according to the website. Can you provide a working number? – Gregor Jul 10 '20 at 12:18
  • @Gregor Here is an example, 0108301010000 thanks for the help – Elliot Huebler Jul 11 '20 at 01:47

1 Answers1

1

To get the result using the parcel number, your parameters have to be somewhat different from what you have already tried with. Moreover, you have to use this url https://www.mytaxcollector.com/trSearchProcess.aspx to send the post requests.

Working code:

from urllib.request import Request, urlopen
from urllib.parse import urlencode
from bs4 import BeautifulSoup

url = 'https://www.mytaxcollector.com/trSearchProcess.aspx'

payload = {
    'hidRedirect': '',
    'hidGotoEstimate': '',
    'txtStreetNumber': '',
    'txtStreetName': '',
    'cboStreetTag': '(Any Street Tag)',
    'cboCommunity': '(Any City)',
    'txtParcelNumber': '0108301010000',  #your search term
    'txtPropertyID': '',
    'ctl00$contentHolder$cmdSearch': 'Search'
}

data = urlencode(payload)
data = data.encode('ascii')
req = Request(url,data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36')
res = urlopen(req)
soup = BeautifulSoup(res.read(),'html.parser')
for items in soup.select("table.propInfoTable tr"):
    data = [item.get_text(strip=True) for item in items.select("td")]
    print(data)
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Can you elaborate on the "table.propInfoTable tr" part ? – Elliot Huebler Jul 11 '20 at 06:51
  • Check out this [documentation](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors) to get an idea how css selectors work. Thanks. – SIM Jul 11 '20 at 06:53
  • Thank you so much @SIM just one more follow up: I want to "click" on an item on that page to extract data from there. My thought was to add it as another item in the payload. The problem is that it has an href instead of a value Class = ctl00_menuHolder_trLeftNav_LeftNavMenuControl_1 href = https://www.mytaxcollector.com/trPropInfo_CurrentTaxes.aspx?enc=Wz1KLEMQssuno6MIxEhuMOXC6hNsgbw5yKt3JzUWaXqc0vvfzwYJ2QEU9STb5hMk – Elliot Huebler Jul 11 '20 at 23:53
  • Hi Elliot, create another post describing your current issue and drop here a link. I'll take a look. Thanks. – SIM Jul 12 '20 at 03:23