0

I want to scrape a site with python and BeautifulSoup but I can't find the page number and I can scrape only the first page of the site, I think this site used Ajax and when I change the page the URL address doesn't change.

It's the link of the site:

https://ihome.ir/sell-residential-apartment/th-tehran

And this is my code, I want to scrape 20 pages of this site, scrape houses with their details like prices, foundation, etc

import requests
from bs4 import BeautifulSoup

response = requests.get("https://ihome.ir/sell-residential-apartment/th-tehran")


soup = BeautifulSoup(response.json(), "html.parser")
prices = soup.select('.sell-value')
titles = soup.select('.title')

homes_prices = []
for home in prices:
    homes_prices.append(int(''.join(filter(str.isdigit, home.getText()))))


homes_titles = []
for title in titles:
    homes_titles.append(title.getText())

res = dict(zip(homes_titles, homes_prices))

for key, value in res.items():
    p = str(res[key])
    if len(str(res[key])) <= 2:
        p += '000000000'
    if len(str(res[key])) > 2:
        p += '000000'

    print(key.strip(), int(p))
Ariaban
  • 3
  • 5
  • here is it [check](https://scorpion.ihome.ir/v1/flatted-properties?is_sale=1&source=website&paginate=24&page=2&locations[]=iran.th.tehran&property_type[]=residential-apartment) – αԋɱҽԃ αмєяιcαη Apr 12 '20 at 08:48
  • @αԋɱҽԃαмєяιcαη thanks, how can I use request with this link? – Ariaban Apr 12 '20 at 08:59
  • use this link with `reuqests` like any other link `r = requests.get(link)`. It seems it doesn't need any special header for this page. Difference is only that you can get result `r.json()` instead of `r.text` and you don't have to use `BeautifulSoup`. – furas Apr 12 '20 at 09:18
  • @furas tnx for your answer but I want scrape site by beautiful soup, I edit my question. how can I do that? – Ariaban Apr 12 '20 at 09:35
  • if you can get it directly as JSON then don't waste time for BeautifulSoup. – furas Apr 12 '20 at 18:37
  • @furas thank you, what is the api of the this link? https://ihome.ir/sell-residential-apartment/th-tehran/district1-zafaraniyeh How can I find that? – Ariaban Apr 13 '20 at 07:31
  • use [DevTools](https://developers.google.com/web/tools/chrome-devtools) built-in in `Chrome/Firefox` (tab `Network`) to see all requests send from browser to server. If you also use filter `XHR` which means `AJAX` then you should see all requests send by JavaScript. If you select some of this link then you can see all headers and response - and you can check (manually) if there are data which you need. – furas Apr 13 '20 at 12:58
  • @furas thank you my friend. I got it, – Ariaban Apr 13 '20 at 13:13

1 Answers1

1

There's no need to use BeautifulSoup as the data which you are looking for it. is already presented within the JSON dict!

Here's the Back-End API, where the data fetched from it.

As you are looking to scrape 20 pages and each page containing 24 items.

So it's 24 * 20 = 480, So I've adjusted the result per page to be 480 and called the API one time better than looping over the pages multiple times.

Now you do have a JSON dict which you can access and extract whatever you want!

import requests


params = {
    'is_sale': '1',
    'source': 'website',
    'paginate': '480',
    'page': '1',
    'locations[]': 'iran.th.tehran',
    'property_type[]': 'residential-apartment'
}


def main(url):
    r = requests.get(url, params=params).json()
    for item in r['data']:
        print(item.keys())


main("https://scorpion.ihome.ir/v1/flatted-properties")