Scraping website with Scrapy

Question

Trying to scrape this website for 1) bedrooms 2) price 3) size: https://www.rentfaster.ca/ab/calgary/rentals

I'm using scrapy but for some reason when I try to even just pull the posting titles, it finishes without returning anything.

See code below:

import scrapy


    class Spiderman(scrapy.Spider):
        name = 'Mr Spider'

        start_urls = ['https://www.rentfaster.ca/ab/calgary/rentals/?keywords=&cur_page=0&proximity_type=location-city&novacancy=0&city_id=1']


        def parse(self, response):

                listings = response.xpath('//h4[@class=" listing-title"]')

                print (listings)

first `print()` all HTML from response to see if you get expected element. Server may sends different data if it think that you are bot. And it may use JavaScript to put data on page but Scarpy doesn't run scripts. So you will have to sue `Selenium` to control web browser which will run JavaScript. — furas, Nov 25 '17 at 21:47
BTW: this page use JavaScript to get data as JSON and put on page - see https://www.rentfaster.ca/api/search.json?proximity_type=location-city&novacancy=0&city_id=1 — furas, Nov 25 '17 at 21:50
@furas thanks for the JSON link, how would you go about scraping the data on this? I've never used Selenium, is this the path of least resistance? — RageAgainstheMachine, Nov 25 '17 at 21:54
Python has standard module `json` which can convert it into dictionary and you can easily get data. Using my link to JSON you don't need Selenium. You don't event have to load HTML pages. — furas, Nov 25 '17 at 22:01
I'm not sure but maybe `Scrapy` can get this and automatically convert into dictionary. — furas, Nov 25 '17 at 22:03
[Scraping a JSON response with Scrapy](https://stackoverflow.com/questions/18171835/scraping-a-json-response-with-scrapy) — furas, Nov 25 '17 at 22:04

score 0 · Answer 1 · answered Nov 26 '17 at 08:53

0

This page uses Javascript to get data as JSON from address

https://www.rentfaster.ca/api/search.json?proximity_type=location-city&novacancy=0&city_id=1

so you can get all data much easier.

It is simple working example using urllib.request (instead of scrapy)

import urllib.request
import json

city_id = 1

url = 'https://www.rentfaster.ca/api/search.json?proximity_type=location-city&novacancy=0&city_id=' + str(city_id)

r = urllib.request.urlopen(url)
data = json.loads(r.read())

print('title:',    data['listings'][0]['title'])
print('bedrooms:', data['listings'][0]['bedrooms'])
print('price:',    data['listings'][0]['price'])
print('size:',     data['listings'][0]['sq_feet'])

To see 10 elements

for x in range(10):
    print('title:',    data['listings'][x]['title'])
    print('bedrooms:', data['listings'][x]['bedrooms'])
    print('price:',    data['listings'][x]['price'])
    print('size:',     data['listings'][x]['sq_feet'])

or to see all

for item in data['listings']:
    print('title:',    item['title'])
    print('bedrooms:', item['bedrooms'])
    print('price:',    item['price'])
    print('size:',     item['sq_feet'])

To see available keys/fields

print(data.keys())

print(data['listings'][0].keys())

.

dict_keys(['listings', 'query', 'total', 'total2'])

dict_keys(['ref_id', 'userId', 'id', 'title', 'price', 'type', 'sq_feet', 'availability', 'avdate', 'location', 'rented', 'thumb', 'thumb2', 'slide', 'link', 'latitude', 'longitude', 'marker', 'address', 'address_hidden', 'city', 'province', 'intro', 'community', 'quadrant', 'phone', 'email', 'status', 'bedrooms', 'baths', 'cats', 'dogs', 'utilities_included'])

answered Nov 26 '17 at 08:53

furas

134,197
12
106
148

Thanks for that!, but for some reason I only get around 1400 results instead of 6500 that are listed on the site. Any idea why? Perhaps not all listings are available on the json page? There's no official API. Would this be possible to do from scraping the actual webpage:https://www.rentfaster.ca/ab/calgary/rentals/?beds=&baths=&type=&price_range_adv%5Bfrom%5D=null&price_range_adv%5Bto%5D=null – RageAgainstheMachine Nov 26 '17 at 13:37
1

on site you have to go to next page to get more. With JSON is the same - it gives data for current page, not all data. Check url after you get next page - there is new parameter `cur_page=1`. Use it in url for JSON data and you should get more data. BTW: print `data['total']` it shows how many data it found – furas Nov 26 '17 at 14:03
see JSON data https://www.rentfaster.ca/api/search.json?beds=&type=&price_range_adv[from]=null&price_range_adv[to]=null&proximity_type=location-city&novacancy=0&city_id=1 Compare parameters in this url with your url to webpage - they may use the same parameters so it is easy to generate other urls. I used `DevTools` in Chrome/Firefox to find this url. DevTool is built-in tool which display all requests/urls send to server. – furas Nov 26 '17 at 14:38
it works! TY! I know this may belong in another question but I'm trying to remove all rows that have letter characters in the "sq_feet" column. for some reason I get " File "pandas\_libs\parsers.pyx", line 1524, in pandas._libs.parsers._string_box_utf8 (pandas\_libs\parsers.c:23041) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 6: invalid start byte" . My code is : df4 = df[~df['sq_feet'].astype(str).str.contains(['a','e','i','o','u','$','-','+','f'])] – RageAgainstheMachine Nov 26 '17 at 14:45
it seems it uses differnt encoding than `utf-8` and it can;t recognize char `0xb2`. You can check encoding used by webpage or find in Google what char have code `0xb2` – furas Nov 26 '17 at 14:52
`0xb2` can be `²` in Latin-1 (see http://www.idautomation.com/product-support/ascii-chart-char-set.html ) so data may need to decode from Latin-1 into Unicode or utf-2 – furas Nov 26 '17 at 14:58
Hi @furas I'm trying to do the same thing. But I presume the rentfaster `search.json` file does not exist anymore, because I cannot find it in Networks tab of `DevTools`. Instead there is a `map.json` (https://www.rentfaster.ca/api/map.json?cities=calgary) that doesn't change after adding `cur_page`. Do you have any idea how I can extract more results? by default only 500 records are extracted. – mitra mirshafiee Apr 05 '22 at 10:18
@mitramirshafiee this answe has 5 years and page could change few times. but when I click link in my answer then I still get some JSON data. Maybe your link gives all addreses in some region and you have to set different region to get more - or maybe it use different parameters to get next data - ie. `offset`, `start`, etc. – furas Apr 05 '22 at 12:52
1

@mitramirshafiee I displayed page and scrolled list and it seems it display only 500 results - so it seems it doesn't have method to get next 500 results. But if you scroll map or change zoom then you can get results for smaller area and then this area may have less then 500 results and you can get all results at once. So it would need to run requests for smaller areas and to get all results for bigger area. – furas Apr 05 '22 at 13:23
1

Thank you @furas ! Your answer gave me a hint and I started querying on different types of residence (e.g., Apartment, House, Condo, etc.). Each gave me less than or equal to 500 unique results and concatenating them resulted in what I had in mind. – mitra mirshafiee Apr 06 '22 at 05:35

Scraping website with Scrapy

1 Answers1