0

I am trying to scrape some real estate data on https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1. Calling fetch('https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1'),returns the following error:

[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1> (failed 1
times): 429 Unknown Status

Does anyone know how to bypass this? I have tried playing with the settings in settings.py, but to no avail.

CountDOOKU
  • 289
  • 3
  • 14

2 Answers2

1

In such cases, most probably website is blocking your requests because it is unable to identify you.

Using Request Headers and Cookies- This is one of the anti-ban method to scrape the website, which is called Reverse Engineering the request. So You open browser tools, and copy paste the - Request Headers, Cookies. Use it to request the website.

The below code worked for me(Check the screenshot below the code). Let me know if any other doubts.

Happy Scraping :)

import scrapy



class RealEstateSpider(scrapy.Spider):
    name = 'real_estate'
    allowed_domains = ['www.realestate.com.au']
    

    cookies = {
    'reauid': '547662688c4100005bfcc662d802000027a00900',
    'Country': 'US',
    'split_audience': 'e',
    'fullstory_audience_split': 'B',
    'pageview_counter.srs': '3',
    'AMCV_341225BE55BBF7E17F000101%40AdobeOrg': '-330454231%7CMCIDTS%7C19181%7CMCMID%7C40562913130153436483026945851729928270%7CMCAID%7CNONE%7CMCOPTOUT-1657305978s%7CNONE%7CMCAAMLH-1657903578%7C9%7CMCAAMB-1657903578%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCSYNCSOP%7C411-19188%7CvVersion%7C3.1.2',
    '_sp_id.2fe7': 'f04532c6-7eff-4510-9f46-a1637b4f9d9e.1657207905.2.1657298780.1657207922.ca64a257-4df2-4c86-836b-ee84f7623444',
    'mid': '16329776161705650988',
    '_gcl_au': '1.1.1838423196.1657207906',
    '_ga_F962Q8PWJ0': 'GS1.1.1657298779.3.1.1657298779.0',
    '_ga': 'GA1.3.122498511.1657207906',
    'DM_SitId1464': 'true',
    'DM_SitId1464SecId12708': 'true',
    's_ecid': 'MCMID%7C40562913130153436483026945851729928270',
    'AMCVS_341225BE55BBF7E17F000101%40AdobeOrg': '1',
    'External': '%2FAPPNEXUS%3D0%2FCASALE%3D0%2FOPENX%3D1c46c6ca-49f7-4e85-8e53-df51c5836f63%2FPUBMATIC%3D164D9B19-CF60-4BBC-967B-99F765370BA2%2FRUBICON%3DL5B6SW92-N-17X3%2FTRIPLELIFT%3D106805729842281714002%2F_EXP%3D1688834837%2F_exp%3D1688834838',
    'VT_LANG': 'language%3Den-US',
    'QSI_HistorySession': 'https%3A%2F%2Fwww.realestate.com.au%2Fsold%2Fin-brisbane%2B-%2Bgreater%2Bregion%2C%2Bqld%2Flist-1~1657207910104',
    'nol_fpid': '4ajjr7ztynccd67dremidougkq4761657207910|1657207910320|1657207926812|1657207927050',
    'cto_bundle': 'nLopul8lMkZndkdvTnFkOGEzSWdDbWZiTGdiYUtVJTJCb1lHUWM1RjdIJTJCTG9nWExBQzRPYzdJWjVTVnBoRmx0eU5zTzIlMkZjVXFIdXVaYUpjM3lONXFBU0Ezd3hiRWo3N0FaZWlZQ2lNS0NDbkpuNUNLcktNSHhVZnJBRkd3ZXluQ1ZiZWNXc0wyaGZnRlp3dGlhZWwlMkZMSGc4bUVxRjJnJTNEJTNE',
    '_fbp': 'fb.2.1657207911199.715859527',
    'QSI_SI_eUTxcS7Ex4BwMYt_intercept': 'true',
    'KP2_UIDz-ssn': '07zhEGcjTRPiwRZzYptXMg0Ec0xmc8b0h4liOLtc9xwkA86wmGHH0Gn9ee3rCatQ4nQ5wxZDfMr42anMhx6OiNSR2KkpJLTdDmofxTCS4KpuD9vdVZ3piWCctOREJyzGQSWDBMeXK8LWFtMbVEFRdpNcZOcgUzT98rHKUMAUXWro4x3bWjm7LdLMU4l',
    'KP2_UIDz': '07zhEGcjTRPiwRZzYptXMg0Ec0xmc8b0h4liOLtc9xwkA86wmGHH0Gn9ee3rCatQ4nQ5wxZDfMr42anMhx6OiNSR2KkpJLTdDmofxTCS4KpuD9vdVZ3piWCctOREJyzGQSWDBMeXK8LWFtMbVEFRdpNcZOcgUzT98rHKUMAUXWro4x3bWjm7LdLMU4l',
    '_sp_ses.2fe7': '*',
    'DM_SitIdT1464': 'true',
    'DM_SitId1464SecIdT12708': 'true',
    '_gid': 'GA1.3.437462869.1657298782',
    }

    headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'If-None-Match': 'W/"101189-/8ueETLeJ0u2liUpilB9lkVxr6w"',
    'Cache-Control': 'max-age=0',
    }

    def start_requests( self ):
        yield scrapy.Request('https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1', headers= self.headers, cookies=self.cookies)

    def parse(self, response):
        link_property = response.css('title::text').get()
        print(link_property)





enter image description here

  • Also, you can take a look at this thread to understand more on 429 status code https://stackoverflow.com/questions/43630434/how-to-handle-a-429-too-many-requests-response-in-scrapy?noredirect=1&lq=1 – Neha Setia Nagpal Jul 08 '22 at 10:10
  • Hi Neha, thanks for the answer but unfortunately it is still not working. Even the solutions on the link you provided – CountDOOKU Jul 08 '22 at 16:40
  • Hey @CountDOOKU, Read my current answer. Hope this works, if any other doubts please feel free to ask :) – Neha Setia Nagpal Jul 09 '22 at 06:50
  • Hi Neha. unfortunately for some reasons, it no longer works,. I am getting ```2022-07-10 15:54:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 https://www.realestate.com.au/sold/property-house-townhouse-villa-in-fortitude+ valley,+qld+4006/list-1?activeSort=solddate&source=refinement>: HTTP status code is not handled or not allowed 2022-07-10 15:54:28 [scrapy.core.engine] INFO: Closing spider (finished)``` – CountDOOKU Jul 10 '22 at 05:57
  • As I mentioned earlier, the websites are protected with anti-bot measures. So, their are some best practises one must follow to bypass bans. Read this blog, to understand more about [antibans](https://www.zyte.com/blog/how-to-scrape-the-web-without-getting-blocked/) – Neha Setia Nagpal Jul 11 '22 at 08:19
0

I don't believe the error 429 being returned has to do with actually requesting too much, but it certainly an anti-scraping measure. That said, I can get the data with requests:

import requests


cookies = {
    'reauid': '547662688c4100005bfcc662d802000027a00900',
    'Country': 'US',
    'split_audience': 'e',
    'fullstory_audience_split': 'B',
    'pageview_counter.srs': '3',
    'AMCV_341225BE55BBF7E17F000101%40AdobeOrg': '-330454231%7CMCIDTS%7C19181%7CMCMID%7C40562913130153436483026945851729928270%7CMCAID%7CNONE%7CMCOPTOUT-1657305978s%7CNONE%7CMCAAMLH-1657903578%7C9%7CMCAAMB-1657903578%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI%7CMCSYNCSOP%7C411-19188%7CvVersion%7C3.1.2',
    '_sp_id.2fe7': 'f04532c6-7eff-4510-9f46-a1637b4f9d9e.1657207905.2.1657298780.1657207922.ca64a257-4df2-4c86-836b-ee84f7623444',
    'mid': '16329776161705650988',
    '_gcl_au': '1.1.1838423196.1657207906',
    '_ga_F962Q8PWJ0': 'GS1.1.1657298779.3.1.1657298779.0',
    '_ga': 'GA1.3.122498511.1657207906',
    'DM_SitId1464': 'true',
    'DM_SitId1464SecId12708': 'true',
    's_ecid': 'MCMID%7C40562913130153436483026945851729928270',
    'AMCVS_341225BE55BBF7E17F000101%40AdobeOrg': '1',
    'External': '%2FAPPNEXUS%3D0%2FCASALE%3D0%2FOPENX%3D1c46c6ca-49f7-4e85-8e53-df51c5836f63%2FPUBMATIC%3D164D9B19-CF60-4BBC-967B-99F765370BA2%2FRUBICON%3DL5B6SW92-N-17X3%2FTRIPLELIFT%3D106805729842281714002%2F_EXP%3D1688834837%2F_exp%3D1688834838',
    'VT_LANG': 'language%3Den-US',
    'QSI_HistorySession': 'https%3A%2F%2Fwww.realestate.com.au%2Fsold%2Fin-brisbane%2B-%2Bgreater%2Bregion%2C%2Bqld%2Flist-1~1657207910104',
    'nol_fpid': '4ajjr7ztynccd67dremidougkq4761657207910|1657207910320|1657207926812|1657207927050',
    'cto_bundle': 'nLopul8lMkZndkdvTnFkOGEzSWdDbWZiTGdiYUtVJTJCb1lHUWM1RjdIJTJCTG9nWExBQzRPYzdJWjVTVnBoRmx0eU5zTzIlMkZjVXFIdXVaYUpjM3lONXFBU0Ezd3hiRWo3N0FaZWlZQ2lNS0NDbkpuNUNLcktNSHhVZnJBRkd3ZXluQ1ZiZWNXc0wyaGZnRlp3dGlhZWwlMkZMSGc4bUVxRjJnJTNEJTNE',
    '_fbp': 'fb.2.1657207911199.715859527',
    'QSI_SI_eUTxcS7Ex4BwMYt_intercept': 'true',
    'KP2_UIDz-ssn': '07zhEGcjTRPiwRZzYptXMg0Ec0xmc8b0h4liOLtc9xwkA86wmGHH0Gn9ee3rCatQ4nQ5wxZDfMr42anMhx6OiNSR2KkpJLTdDmofxTCS4KpuD9vdVZ3piWCctOREJyzGQSWDBMeXK8LWFtMbVEFRdpNcZOcgUzT98rHKUMAUXWro4x3bWjm7LdLMU4l',
    'KP2_UIDz': '07zhEGcjTRPiwRZzYptXMg0Ec0xmc8b0h4liOLtc9xwkA86wmGHH0Gn9ee3rCatQ4nQ5wxZDfMr42anMhx6OiNSR2KkpJLTdDmofxTCS4KpuD9vdVZ3piWCctOREJyzGQSWDBMeXK8LWFtMbVEFRdpNcZOcgUzT98rHKUMAUXWro4x3bWjm7LdLMU4l',
    '_sp_ses.2fe7': '*',
    'DM_SitIdT1464': 'true',
    'DM_SitId1464SecIdT12708': 'true',
    '_gid': 'GA1.3.437462869.1657298782',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'If-None-Match': 'W/"101189-/8ueETLeJ0u2liUpilB9lkVxr6w"',
    'Cache-Control': 'max-age=0',
}

response = requests.get('https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1',cookies=cookies, headers=headers)

output: '<!doctype html>\n<html lang="en-AU">\n<head>\n <meta charset="utf-8"/>\n <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n <meta name="viewport" content="width=device-width,initial-scale=1,minimum-scale=1">\n <meta name="format-detection" content="telephone=no">\n <title data-react-helmet="true">Sold Property Prices &amp; Auction Results in Brisbane - Greater Region, QLD - realestate.com.au</title> <link data-react-helmet="true" rel="canonical" href="https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1"/><link data-react-helmet="true" href="https://m.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-1" rel="alternate" media="only screen and (max-width: 640px)"/><link data-react-helmet="true" rel="next" href="https://www.realestate.com.au/sold/in-brisbane+-+greater+region,+qld/list-2"/> <meta data-react-helmet="true" name="description" content="282214 sold properties in Brisbane - Greater Region, QLD. View the latest property sold prices and auction results in Brisbane - Greater Region with realestate.com.au."/> <script data-react-helmet="true" type="application/ld+json">[{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Kangaroo Point","addressRegion":"QLD","postalCode":"4169","streetAddress":"14/10 Park Avenue"},"name":"14/10 Park Avenue"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Nundah","addressRegion":"QLD","postalCode":"4012","streetAddress":"3/38 Franklin Street"},"name":"3/38 Franklin Street"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Bracken Ridge","addressRegion":"QLD","postalCode":"4017","streetAddress":"22 Rinnicrew Street"},"name":"22 Rinnicrew Street"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Stafford","addressRegion":"QLD","postalCode":"4053","streetAddress":"8/66 Gamelin Crescent"},"name":"8/66 Gamelin Crescent"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Karana Downs","addressRegion":"QLD","postalCode":"4306","streetAddress":"6 Illawong Way"},"name":"6 Illawong Way"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Durack","addressRegion":"QLD","postalCode":"4077","streetAddress":"9/80 Cintra Street"},"name":"9/80 Cintra Street"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"South Brisbane","addressRegion":"QLD","postalCode":"4101","streetAddress":"10809/22 Merivale Street"},"name":"10809/22 Merivale Street"},{"@context":"http://schema.org","@type":"Residence","address":{"@type":"PostalAddress","addressLocality":"Wavell Heights","addressRegion":"QLD","postalCode":"4012","streetAddress":"4/7 Rode Road"},"name":"4/7 Rode Road"},{"@context":"http://schema.org","@type":"Residence","address":

That said, you may not need all of the header entries or cookie entries. My suggestion would be to see what you need and add it to your request.

1extralime
  • 606
  • 3
  • 6
  • Thanks, what command did you use for the output? Also this only gets the data right? I still need to use beautifulsoup to parse it? – CountDOOKU Jul 09 '22 at 05:27
  • I'm not sure I understand your question about the output. But yes, this is just the text from http response, you would need to parse it. – 1extralime Jul 09 '22 at 19:02