3

I'm using both xpath and beautifulsoup to scrape webpage. Xpath need tree as input and beautifulsoup need soup as input. Here're the code to get tree and soup:

def get_tree(url):
    r = requests.get(url)
    tree = html.fromstring(r.content)
    return tree

# get soup
def get_soup(url):
    r = requests.get(url)
    data = r.text
    soup = BeautifulSoup(data)
    return soup

Both of these method uses requests.get(url). That's what I want to store ahead. Here's the code in python:

import requests
url = "http://www.nytimes.com/roomfordebate/2013/10/28/should-you-bribe-your-kids"
r = requests.get(url)
f = open('html','wb')
f.write(r)

And then I got error like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: must be convertible to a buffer, not Response

Here's the code to store the text, and I got error:

import requests
from lxml import html
url = "http://www.nytimes.com/roomfordebate/2013/02/13/when-divorce-is-a-family-affair"
r = requests.get(url)
c = r.content
outfile = open("html", "wb")
outfile.write(c)
outfile.close()
infile = open("html", "rb")
tree = html.fromstring(infile)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/lxml/html/__init__.py", line 662, in fromstring
    start = html[:10].lstrip().lower()
TypeError: 'file' object has no attribute '__getitem__'

How could I resolve this?

nazmus saif
  • 167
  • 1
  • 2
  • 12

3 Answers3

3
infile = open("html", "rb") #this is a file object Not a string

You need to read it first with read() and not just open :-)-

infile = open("html", "rb")
infile=infile.read()
tree = html.fromstring(infile)
Md. Mohsin
  • 1,822
  • 3
  • 19
  • 34
0

requests.get returns a response object.

I guess the write wants text. What you want is the response's contents, which is also text.

r = requests.get(url).content
JL Peyret
  • 10,917
  • 2
  • 54
  • 73
0

fromstring() expects a string as an input. Since you have a file, you need to use parse():

>>> tree = html.parse(infile)
>>> tree.findtext('//title')
When Divorce Is a Family Affair - Room for Debate - NYTimes.com
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195