How to speed up scraping in python?

Question

i want to display 10 hotels which are best from user perspective. Suppose user will enter 'pool' then i have to math the keyword 'pool' in the user reviews from tripadvisor then take a count and display the top 10 hotels name according to count. For this purpose i am currently scrapping all the reviews of hotels(dubai) then i will match keyword and display the top 10 hotel names.but hotel review scrapping is taking too much time what i can do? any help? ANy other method other than scraping?this is my code for scrapping reviews :

import requests
from bs4 import BeautifulSoup

offset = 0
url = 'https://www.tripadvisor.com/Hotels-g295424-oa' + str(offset) + '-Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'

urls = []
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

for link in soup.find_all('a', {'last'}):
      page_number = link.get('data-page-number')
      last_offset = int(page_number) * 30
      print('last offset:', last_offset)

for offset in range(0, last_offset, 30):
   print('--- page offset:', offset, '---')

     url = 'https://www.tripadvisor.com/Hotels-g295424-oa' + str(offset) + '-Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'

    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
        iurl='https://www.tripadvisor.com/' + link.get('href')

        r = requests.get(iurl)
        soup = BeautifulSoup(r.content, "lxml")
        #look for the partial entry of the review
        resultsoup = soup.find_all("p", {"class" : "partial_entry"})

           for review in resultsoup:
              review_list = review.get_text()
              print(review_list)

score 1 · Answer 1 · answered Jan 07 '17 at 12:16

you should use a database to store the data that you are scraping for reuse, not do the same work again.

And there is a slight improvement to your code: use requests.Session() to maintain the connection to the server

Requests Document:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).

with requests.Session() as session:
    for offset in range(0, last_offset, 30):
        print('--- page offset:', offset, '---')

    url = 'https://www.tripadvisor.com/Hotels-g295424-oa' + str(offset) + '-Dubai_Emirate_of_Dubai-Hotels.html#EATERY_LIST_CONTENTS'

    r = session.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    for link in soup.find_all('a', {'property_title'}):
        iurl='https://www.tripadvisor.com/' + link.get('href')

        r = session.get(iurl)

is it ok if i match the keyword on webpage ? and not scrap the reviews? — Techgeeks1, Jan 07 '17 at 12:25
@Hifza ahmad yes, you can, do not use bs4, just get `response.text`, than use regex to findall keywords, this will be much faster. bs4 is the slowest way to parse the html code. — 宏杰李, Jan 07 '17 at 12:50

How to speed up scraping in python?

1 Answers1