How to extract info from this format using beautifulsoup?

Question

I'm trying to make a web scraper for Zillow, and I have successfully found a way to obtain raw info from a Zillow search page. However, I am fairly new to python and very new to web scraping and have no idea how to extract necessary info from this. How should I go about doing this?

Portion of raw info here (from query Houston, Texas) https://paste.ee/p/QyJJG

Current Code (just scrapes info from desired area and spits it out into console)

from bs4 import BeautifulSoup as soup
import numpy as np
import pandas as pd
import requests
import random

headers = {
    'authority': 'www.zillow.com',
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.9',
    'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36' + str(random.randint(1, 1000)), #randomint bypasses captcha
}


with requests.session() as i:
    #user input
    city = str(input("City To Search In: ")) + "/"

    #initializers
    page = 1
    end_page = 10
    url = ""
    url_list = []
    request = ""
    request_list = []

    while page <= end_page:
        url = "https://www.zillow.com/homes/for_sale/" + city + str(page) + "_p"
        url_list.append(url)
        page += 1

    for j in url_list: #can change j to "url" if you want but i like "j" for simplicity
        request = i.get(j, headers=headers)
        request_list.append(request)

#another two initializers (seperating this felt more organized to me, sorry if it looks messy)
rawInfo = ""
rawInfoList = []

for request in request_list:
    rawInfo = soup(request.content, "html.parser")
    rawInfoList.append(rawInfo)

print(rawInfoList)

I tried using find() and findall() but I was pretty confused because it always showed no results. I need the listing address, rent zestimate, link, and zestimate/price of various listings in an area. All of the info needed is in the raw info paste, I just need a way to take it all out and make it more readable. I'm trying to make something that takes all of these different pieces of information then adds them to a separate list for each type of info (price, rentZestimate, etc)

Shouldn't there a forward slash between `city` and `str(page)`? Also, using f-string would look at lot cleaner (IMO). For example, `url=f"https://www.zillow.com/homes/for_sale/{city}/{str(page)}_p"`. — Übermensch, Mar 17 '23 at 02:46
Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Blue Robin, Mar 17 '23 at 04:12

score 0 · Answer 1 · answered Mar 17 '23 at 04:48

I think the information you want is located at a different URL. For example when I go to https://www.zillow.com/austin-tx/3_p/ it doesn't show the rent Zestimate nor the Zestimate/price. For this information, you have to click on the property card which redirects you to another url (e.g. https://www.zillow.com/homedetails/9005-Ipswich-Bay-Dr-Austin-TX-78747/80104660_zpid/). This is likely the reason the BS methods are not returning the expected output.

An arguable better way to get the info you want is by reverse engineering the website's request. Using my browser's Dev Tools, under the Network tab, I found that website sends a GET request to this URL; the response is a JSON file that seems to contain all of data you want (i.e. listing address, rent zestimate, link, and zestimate/price).

Use the json module, instead of BS, to parse the JSON file for the data you want.

Here's the code:

import requests
import random

import json

def save_json_output(out_path, json_output):
    # save json file for easy access
    with open(out_path, mode='w') as file:
        file.write(json_output)

def get_json(url, request_headers):
    response = requests.get(url, headers=request_headers)

    # create json object
    json_obj = response.json()

    # creat JSON string with 4-space indentation
    json_string = json.dumps(json_obj, indent=4)

    return json_string

if __name__ == "__main__":
    
    headers = {
        'authority': 'www.zillow.com',
        'accept': '*/*',
        'accept-language': 'en-US,en;q=0.9',
        'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
        'sec-ch-ua-mobile': '?0',
        'sec-ch-ua-platform': '"Windows"',
        'sec-fetch-dest': 'empty',
        'sec-fetch-mode': 'cors',
        'sec-fetch-site': 'same-origin',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36' + str(random.randint(1, 1000)), #randomint bypasses captcha
    }

    url = "https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%22currentPage%22%3A2%7D%2C%22mapBounds%22%3A%7B%22north%22%3A30.728145092457208%2C%22south%22%3A29.85779787256072%2C%22east%22%3A-97.26683659374999%2C%22west%22%3A-98.36546940624999%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A10221%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%2C%22sortSelection%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants={%22cat1%22:[%22mapResults%22]}&requestId=2"

    path = "C:/Users/uber/RandomScripts/data/zillow_data.json"

    listings_data = get_json(url, headers)
    save_json_output(path, listings_data)

How to extract info from this format using beautifulsoup?

1 Answers1