3

I am trying to scrape the links in a webpage with infinite scrolling. I am able to fetch only the links on the first pane. How to proceed ahead so as to form a complete list of all the links. Here is what i have so far -


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
    try:
        y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
    except AttributeError:
        pass

z = [('carwale.com' + item) for item in y]
z
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57

2 Answers2

1

You do not need BeautifulSoup to ninja HTML dom at all, as the website provides JSON responses that populated the HTML. Requests alone can do the work. If you monitor "Network" from Chrome or Firefox Development tool, you will see that for each load, the browser sends a get request to an API. Using that we can get clean json data out.

Disclaimer: I have not checked if this site allows web scraping. Do double check their terms of use. I am assuming that you did that.

I used Pandas, to help in dealing with tabular data and also exporting data to CSV or whatever format you prefer: pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session() 
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36',
          'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
    r = req.get(BASE_URL, params=params)
    if not r.ok:
        raise ConnectionError('We did not get 200')
    data = r.json()

    return  pd.DataFrame(data['ResultData'])


# Just first 5 pages :)    
for i in range(5):
    params['pn']+=1
    params['lcr']*=2

    dt = scrap_carwale(params)
    #append your data
    df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

This is the network enter image description here

Response: enter image description here

Sample of Results enter image description here

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • In the filters sections, we can filter a car based on the brand. Specific model of a particular brand is being loaded on runtime i.e. after i have selected a particular brand. I am able to extract carID against a particular brand, but how can the same be done for a particular model. I need the carfilterID under the li class "us-sprite rootLi" -
  • Beat  (296)
  • – Anant Gupta Feb 17 '20 at 06:50
  • In the say way as I did above. Just select a particular brand and see the Network GET requests. You will see a change in params e.g. BMW, the only param that adds filter is `car=1+7`. So any filtering you do on front-end, it is adding to the params – Prayson W. Daniel Feb 17 '20 at 07:24
  • By your way, the rootID column is capturing the model. I want to capture the list of all the distinct models in a particular brand. Carwale website doesn't provide new cars after 9 loads of page, and it might turn out to be the case that a particular model doesn't get displayed even after 9 loads. – Anant Gupta Feb 17 '20 at 08:39