0

I have been unsuccessful in trying to gather data from Zillow.

Example:

url = https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy

I want to pull information like addresses, prices, zestimates, locations from all homes in LA.

I have tried HTML scraping using packages like BeautifulSoup. I also have tried using the json. I'm almost positive that Zillow's API will not be helpful. It's my understanding that the API is best for gathering information on a specific property.

I have been able to scrape information from other sites but it seems that Zillow uses dynamic ids (change every refresh) making it more difficult to access that information.

UPDATE: Tried using the below code but am still not producing any results

import requests
from bs4 import BeautifulSoup

url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'

page = requests.get(url)
data = page.content

soup = BeautifulSoup(data, 'html.parser')

for li in soup.find_all('div', {'class': 'zsg-photo-card-caption'}):
    try:
        #There is sponsored links in the list. You might need to take care 
        #of that
        #Better check for null values which we are not doing in here
        print(li.find('span', {'class': 'zsg-photo-card-price'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-info'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-address'}).text)
        print(li.find('span', {'class': 'zsg-photo-card-broker-name'}).text)
    except :
        print('An error occured')
Chris Unice
  • 181
  • 2
  • 8

3 Answers3

11

It's probably because you're not passing headers.

If you take a look at Chrome's network tab in developer tools, these are the headers that are passed by the browser:

:authority:www.zillow.com
:method:GET
:path:/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy
:scheme:https
accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
accept-encoding:gzip, deflate, br
accept-language:en-US,en;q=0.8
upgrade-insecure-requests:1
user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36

However, if you try sending all of them, it'll fail, because requests doesn't let you send headers beginning with a colon ':'.

I tried skipping those four alone, and used the other five in this script. It worked. So try this:

from bs4 import BeautifulSoup
import requests

req_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    url = 'https://www.zillow.com/homes/for_sale/Los-Angeles-CA_rb/?fromHomePage=true&shouldFireSellPageImplicitClaimGA=false&fromHomePageTab=buy'
    r = s.get(url, headers=req_headers)

After that, you can use BeautifulSoup to extract the information you need:

soup = BeautifulSoup(r.content, 'lxml')
price = soup.find('span', {'class': 'zsg-photo-card-price'}).text
info = soup.find('span', {'class': 'zsg-photo-card-info'}).text
address = soup.find('span', {'itemprop': 'address'}).text

Here is a sample of data extracted from that page:

+--------------+-----------------------------------------------------------+
| $615,000     |  121 S Hope St APT 435 Los Angeles CA 90012               |
| $330,000     |  4859 Coldwater Canyon Ave APT 14A Sherman Oaks CA 91423  |
| $3,495,000   |  13446 Valley Vista Blvd Sherman Oaks CA 91423            |
| $1,199,000   |  6241 Crescent Park W UNIT 410 Los Angeles CA 90094       |
| $771,472+    |  Chase St. And Woodley Ave # HGS0YX North Hills CA 91343  |
| $369,000     |  8650 Gulana Ave UNIT L2179 Playa Del Rey CA 90293        |
| $595,000     |  6427 Klump Ave North Hollywood CA 91606                  |
+--------------+-----------------------------------------------------------+
0

Scraping Zillow is actually not too difficult. First thing to note is that it's a Next.js website, meaning we can parse javascript objects instead of HTML to scrape structured data.

I write all of this in extensive detail in my blog How to Scrape Zillow.com but let's summarize the most important parts:

Scraping Properties

First, let's take a look at property page itself and how to scrape it's data. If we take a look at the property page's source we can see that __NEXT_DATA__ variable is present:

enter image description here

So we can extract this data with a simple css selector: script#__NEXT_DATA__

Scraping Search

Now to find properties themselves we can use very similar technique:

First, We need to build our search url which looks something like https://www.zillow.com/homes/<QUERY>_rb/ where <QUERY> is some location like zipcode or city name. e.g. https://www.zillow.com/homes/New-Haven,-CT_rb/ Then, If we scrape the page we can find backend API parameters in the page body same way we found __NEXT_DATA__ previously. This time by using regex:

re.findall(r'"queryState":(\{.+}),\s*"filter', html_response.text

Full scraper code is a bit ouf of scope of a Stackoverlfow question but by combining these two techniques we can scrape zillow with very little actual code!

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
-1

You can try some paid tools like https://www.scraping-bot.io/how-to-scrape-real-estate-listings-on-zillow/

  1. find what you need via sitemap https://www.zillow.com/sitemap/catalog/sitemap.xml

  2. scrape data from urls in sitemap

Ryabchenko Alexander
  • 10,057
  • 7
  • 56
  • 88