I'm using both xpath and beautifulsoup to scrape webpage. Xpath need tree as input and beautifulsoup need soup as input. Here're the code to get tree and soup:
def get_tree(url):
r = requests.get(url)
tree = html.fromstring(r.content)
return tree
# get soup
def get_soup(url):
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
return soup
Both of these method uses requests.get(url). That's what I want to store ahead. Here's the code in python:
import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)
And then I got error like this:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response
Here's the code to store the text, and I got error:
import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'
How could I resolve this?