How to read text from a website in Python

Question

I would like to read some of the information from this website: http://www.federalreserve.gov/monetarypolicy/beigebook/beigebook201301.htm

I have the following code, and it properly reads the HTML source

def Connect2Web():
    aResp = urllib2.urlopen("http://www.federalreserve.gov/monetarypolicy/" + 
    "beigebook/beigebook201301.htm")

    web_pg = aResp.read()

    print web_pg

I am lost on how to parse this information, however, because most HTML parsers require a file or the original website, whereas I already have the information I need in a String.

"most HTML parsers require a file or the original website" False. Most of them operate simply on HTML. — Jonathon Reinhart, May 20 '13 at 02:31
Look into [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/). — Jonathon Reinhart, May 20 '13 at 02:31

score 2 · Accepted Answer · answered May 20 '13 at 02:43

We started with BS some time ago but eventually moved to lxml

from lxml import html
my_tree = html.fromstring(web_pg)
elements = [item for item in my_tree.iter()]

So now you have to decide which elements you want and you need to make sure that the elements you keep are not children of other elements that you decide you want to keep for instance

<div> some stuff
<table>
<tr>
<td> banana </td>
</tr>
<table>
some more stuff
</div>

The html above table is a child of div so everything in table is contained in div so you have to use some logic to keep only those elements whose parents are not already kept

score 1 · Answer 2 · answered May 20 '13 at 02:32

1

from bs4 import BeautifulSoup
soup = BeautifulSoup(web_pg)

answered May 20 '13 at 02:32

Joran Beasley

110,522
12
160
179

score 1 · Answer 3 · answered May 20 '13 at 03:12

1

If you like jQuery use pyQuery

start with

from pyquery import PyQuery as pq

d = pq(web_pg)

or even

from pyquery import PyQuery as pq

d = pq(url="http://www.federalreserve.gov/monetarypolicy/beigebook/beigebook201301.htm")

Now d is like the $ in jQuery:

p = d("#hello") # get element with id="hello"
print p.html() # print as html

p = d('#content p:first') # get first <p> from element with id="content"
print p.text() # print as text

answered May 20 '13 at 03:12

furas

134,197
12
106
148

..strong upvote - pyquery is the easiest solution for painless html munging..if the direct `pq(url=...)` fails (f.e. lxml complaining about unsupported "**Unicode strings with encoding declaration**"), fetch the file first via `urllib.urlopen(url).read()`, then feed it to pyquery. – eMPee584 Jul 09 '13 at 18:30

score 1 · Answer 4 · answered May 20 '13 at 09:28

You can also use RE--Regular expression to parse this information (in fact it can parse all text), it is faster than BeautifulSoup and others, but in the same time, regular expression is harder to learn than others

Here is DOCUMENT

example:

import re
p = re.compile(r'<p>(.*?)</p>')
content = r'<p> something </p>'
data = re.findall(p, content)
print data

it print:

[' something ']

This example can extract content which between <p> and </p>

It's just a very simple example to regular expression.

It's very import to use regular expression because it can do more thing than others.

So, just learn it!

Thank you, RegEx is definitely something I plan on learning at some point — weskpga, May 20 '13 at 11:14

How to read text from a website in Python

4 Answers4