Python - save requests or BeautifulSoup object locally

Question

I have some code that is quite long, so it takes a long time to run. I want to simply save either the requests object (in this case "name") or the BeautifulSoup object (in this case "soup") locally so that next time I can save time. Here is the code:

from bs4 import BeautifulSoup
import requests

url = 'SOMEURL'
name = requests.get(url)
soup = BeautifulSoup(name.content)

You might find the [`pickle`](https://docs.python.org/3/library/pickle.html) module useful ... — Zero Piraeus, May 29 '14 at 22:08
What about just saving `html` source code into the `html` files? — alecxe, May 29 '14 at 22:09

merlin2011 · Accepted Answer · 2014-05-30T01:33:15.397

9

Since name.content is just HTML, you can just dump this to a file and read it back later.

Usually the bottleneck is not the parsing, but instead the network latency of making requests.

from bs4 import BeautifulSoup
import requests

url = 'https://google.com'
name = requests.get(url)

with open("/tmp/A.html", "w") as f:
  f.write(name.content)


# read it back in
with open("/tmp/A.html") as f:
  soup = BeautifulSoup(f)
  # do something with soup

Here is some anecdotal evidence for the fact that bottleneck is in the network.

from bs4 import BeautifulSoup
import requests
import time

url = 'https://google.com'

t1 = time.clock();
name = requests.get(url)
t2 = time.clock();
soup = BeautifulSoup(name.content)
t3 = time.clock();

print t2 - t1, t3 - t2

Output, from running on Thinkpad X1 Carbon, with a fast campus network.

0.11 0.02

edited May 30 '14 at 01:33

answered May 29 '14 at 22:15

merlin2011

71,677
44
195
329

2

FYI, you can replace `BeautifulSoup(f.read())` with just `BeautifulSoup(f)`. – alecxe May 29 '14 at 22:25
I have no idea why this response is either accepted or has 8 upvotes. I have just tried it and it is not working. Namely, when trying to f.write(name.content) it returns "TypeError: write() argument must be str, not bytes". Of course, because the type of the Response.content object is byte not html. – G.T. May 10 '22 at 18:37
Probably this was working in earlier version of requests, but in version 2.26.0 this method of writing requests object in a file is not working and the reason is that name.content is byte type not text. – G.T. May 10 '22 at 18:42
Perhaps change `f.write(name.content)` to `f.write(str(name.content))`. The former is byte but the latter is a string. – Borhan Kazimipour May 12 '22 at 02:59

score 1 · Answer 2 · answered Dec 05 '19 at 01:19

Storing requests locally and restoring them as Beautifoul Soup object latter on

If you are iterating through pages of web site you can store each page with request explained here. Create folder soupCategory in same folder where your script is.

Use any latest user agent for headers

headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0 Safari/605.1.15'}

def getCategorySoup():
    session = requests.Session()
    retry = Retry(connect=7, backoff_factor=0.5)

    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)

    basic_url = "https://www.somescrappingdomain.com/apartments?adsWithImages=1&page="    
    t0 = time.time() 
    j=0    
    totalPages = 1525 # put your number of pages here        
    for i in range(1,totalPages):         
        url = basic_url+str(i)
        r  = requests.get(url, headers=headers)
        pageName = "./soupCategory/"+str(i)+".html"
        with open(pageName, mode='w', encoding='UTF-8', errors='strict', buffering=1) as f:
            f.write(r.text)        
            print (pageName, end=" ")
    t1 = time.time()
    total = t1-t0
    print ("Total time for getting ",totalPages," category pages is ", round(total), " seconds")
    return

Latter on you can create Beautifoul Soup object as @merlin2011 mentioned with:

with open("/soupCategory/1.html") as f:
  soup = BeautifulSoup(f)

Python - save requests or BeautifulSoup object locally

2 Answers2

Storing requests locally and restoring them as Beautifoul Soup object latter on

Linked

Related