store html in python

Question

I'm using both xpath and beautifulsoup to scrape webpage. Xpath need tree as input and beautifulsoup need soup as input. Here're the code to get tree and soup:

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Both of these method uses requests.get(url). That's what I want to store ahead. Here's the code in python:

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

And then I got error like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

Here's the code to store the text, and I got error:

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

How could I resolve this?

Md. Mohsin · Accepted Answer · 2014-11-14T02:40:25.507

3

infile = open("html", "rb") #this is a file object Not a string

You need to read it first with read() and not just open :-)-

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)

edited Nov 14 '14 at 02:40

answered Nov 13 '14 at 22:30

Md. Mohsin

1,822
3
19
34

score 0 · Answer 2 · answered Nov 07 '14 at 22:04

0

requests.get returns a response object.

I guess the write wants text. What you want is the response's contents, which is also text.

r = requests.get(url).content

answered Nov 07 '14 at 22:04

JL Peyret

10,917
2
54
73

Thanks. I still got the error when loading the content. See update post. – f4fc2791e4473eb2ba41b5ddb445b2 Nov 07 '14 at 22:41

score 0 · Answer 3 · answered Nov 07 '14 at 23:51

0

fromstring() expects a string as an input. Since you have a file, you need to use parse():

>>> tree = html.parse(infile)
>>> tree.findtext('//title')
When Divorce Is a Family Affair - Room for Debate - NYTimes.com

answered Nov 07 '14 at 23:51

alecxe

462,703
120
1,088
1,195

store html in python

3 Answers3