1

I am trying to scrape a site with a table of real-estate listings and I am accessing the link in each listing/row to get more info about it (surface, region). The table is 25 row on each page and it's taking around 10-12s for each page(so 25 link accessed per page) which I find really slow (there are around 850 pages). I tried using requests.Session() instead of requests.get but I can't tell if it's the same or slightly worse.

My question is this: Is using .Session() for accessing a link only once but on the same site actually slowing the script down compared to just using requests.get() or is request.session() smart enough to keep the cookies/connection for every link that redirects inside the site, i.e if I use:

s=requests.Session()
response = s.get("http://www.tunisie-annonce.com/AnnoncesImmobilier.asp")

and then I collect links inside the response above and access them through the sessions I opened from the main url as such :

def get_surfaces_and_region(session,links):
    for link in links:
        start = time.time()
        html = session.get(link) # instead of requests.get(link)
        new_page = BeautifulSoup(html.text, 'lxml') 
        ### do stuff here ##

What happens when I access another link through the previously open session ? Is it being added to a list in case I access it again ? Would this theoretically slow down the 'get' requests since every link will be unique ? if so , Is requests only good for accessing the same url multiple times over (which I am struggling to find a situation where it makes sense).

Thank you

Dimbo123
  • 13
  • 3

1 Answers1

0

Please referrer to this post and this article about HTTP persistent connection. Using requests.Session() can result in a significant performance increase when used in the right way.

If you’re making several requests to the same host, the underlying TCP connection will be reused when using requests.Session(), which can result in a significant performance increase.

However, it would be helpful to see your full code. It seems like you're using BeautifulSoup within your scraping-pipeline. This could be the real performance bottleneck of your script, as BeautifulSoup's parser is written in Python and can be considered "slow" due to that.

Gordian
  • 101
  • 8