XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

Question

So, I'm trying to scrape a website with infinite scrolling.

I'm following this tutorial on scraping infinite scrolling web pages: https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016

But the example given looks pretty easy, it's an orderly JSON object with the data you want.

I want to scrape this https://www.bahiablancapropiedades.com/buscar#/terrenos/venta/bahia-blanca/todos-los-barrios/rango-min=50.000,rango-max=350.000

The XHR response for each page is weird, looks like corrupted html code This is how the Network tab looks

I'm not sure how to navigate the items inside "view". I want the spider to enter each item and crawl some information for every one.

In the past I've succesfully done this with normal pagination and rules guided by xpaths.

score 1 · Answer 1 · answered Feb 24 '19 at 10:05

https://www.bahiablancapropiedades.com/buscar/resultados/0

This is XHR url. While scrolling the page it will appear the 8 records per request. So do one thing get all records XPath. these records divide by 8. it will appear the count of XHR requests. do below process. your issue will solve. I get the same issue as me. I applied below logic. it will resolve.

pagination_count = xpath of presented number

value = int(pagination_count) / 8

for pagination_value in value:
   url = https://www.bahiablancapropiedades.com/buscar/resultados/+[pagination_value]

pass this url to your scrapy funciton.

Thanks for the reply, But I don't get what you mean by 8 records per request. What are records? I want the spider to get into each item and craw some information of each, I do know how to direct by xpaths, but the information I want appears like a corrupted HTML in the response, or "view" part of the preview tab. — user3303019, Feb 24 '19 at 15:17

score 0 · Accepted Answer · answered Feb 25 '19 at 08:36

It is not corrupted HTML, it is escaped to prevent it from breaking the JSON. Some websites will return simple JSON data and others, like this one, will return the actual HTML to be added.

To get the elements you need to get the HTML out of the JSON response and create your own parsel Selector (this is the same as when you use response.css(...)).

You can try the following in scrapy shell to get all the links in one of the "next" pages:

scrapy shell https://www.bahiablancapropiedades.com/buscar/resultados/3

import json
import parsel

json_data = json.loads(response.text)
sel = parsel.Selector(json_data['view']) # view contains the HTML
sel.css('a::attr(href)').getall()

Thanks for the reply. But I'm a begginer with python and if it's okay with you I'd like to ask you further questions. First. Would you explain the code you suggested me? The first line looks like it gets all of the web content stored in 'view'. The second line is a bit confusing. Not sure what the getall() function do or where does it store the data I want. What I will do with the html is to use it to set rules for the spider to get into each Item and once there, scrap the information. How can i set up tose rules? — user3303019, Apr 21 '19 at 14:50

XHR request pulls a lot of HTML content, how can I scrape it/crawl it?

2 Answers2