-1

Novice web scraper here:

I am trying to scrape the name and address from this website https://propertyinfo.knoxcountytn.gov/Datalets/Datalet.aspx?sIndex=1&idx=1. I have attempted the following code which only returns 'None' or an empty array if I replace find() with find_all(). I would like it to return the html of this particular section so I can extract the text and later add it to a csv file. If the link doesn't work, or take to you where I'm working, simply go to the knox county tn website > property search > select a property.

Much appreciation in advance!

from splinter import Browser 
import pandas as pd
from bs4 import BeautifulSoup as soup 
import requests
from webdriver_manager.chrome import ChromeDriverManager

owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find('td', class_='DataletData')
owner_elem

OR

# this being the tag and class of the whole section where the info is located
owner_soup = soup(html, 'html.parser')
owner_elem = owner_soup.find_all('div', class_='datalet_div_2')
owner_elem

OR when I try:

browser.find_by_css('td.DataletData')[15]

it returns:

<splinter.driver.webdriver.WebDriverElement at 0x11a763160> 

and I can't pull the html contents from that element.

HedgeHog
  • 22,146
  • 4
  • 14
  • 36

1 Answers1

0

There's a few issues I see, but it could be that you didn't include your code as you actually have it.

Splinter works on its own to get page data by letting you control a browser. You don't need BeautifulSoup or requests if you're using splinter. You use requests if you want the raw response without running any of the things that browsers do for you automatically.

One of these automatic things is redirects. The link you provided does not provide the HTML that you are seeing. This link just has a response header that redirects you to https://propertyinfo.knoxcountytn.gov/, which redirects you again to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, which redirects again to https://propertyinfo.knoxcountytn.gov/Search/Disclaimer.aspx?FromUrl=../search/commonsearch.aspx?mode=realprop

On this page you have to hit the 'agree' button to get redirected to https://propertyinfo.knoxcountytn.gov/search/commonsearch.aspx?mode=realprop, this time with these cookies set:

Cookie: ASP.NET_SessionId=phom3bvodsgfz2etah1wwwjk; DISCLAIMER=1

I'm assuming the session id is autogenerated, and the Disclaimer value just needs to be '1' for the server to know you agreed to their terms.

So you really have to study a page and understand what's going on to know how to do it on your own using just the requests and beautifulsoup libraries. Besides the redirects I mentioned, you still have to figure out what network request gives you that session id to manually add it to the cookie header you send on all future requests. You can avoid doing some requests, and so this way is a lot faster, but you do need to be able to follow along in the developer tools 'network' tab.

Postman is a good tool to help you set up requests yourself and see their result. Then you can bring all the set up from there into your code.

Diego Cuadros
  • 196
  • 1
  • 12