0

I would like to read some of the information from this website: http://www.federalreserve.gov/monetarypolicy/beigebook/beigebook201301.htm

I have the following code, and it properly reads the HTML source

def Connect2Web():
    aResp = urllib2.urlopen("http://www.federalreserve.gov/monetarypolicy/" + 
    "beigebook/beigebook201301.htm")

    web_pg = aResp.read()

    print web_pg

I am lost on how to parse this information, however, because most HTML parsers require a file or the original website, whereas I already have the information I need in a String.

weskpga
  • 2,017
  • 7
  • 29
  • 43

4 Answers4

2

We started with BS some time ago but eventually moved to lxml

from lxml import html
my_tree = html.fromstring(web_pg)
elements = [item for item in my_tree.iter()]

So now you have to decide which elements you want and you need to make sure that the elements you keep are not children of other elements that you decide you want to keep for instance

<div> some stuff
<table>
<tr>
<td> banana </td>
</tr>
<table>
some more stuff
</div>

The html above table is a child of div so everything in table is contained in div so you have to use some logic to keep only those elements whose parents are not already kept

PyNEwbie
  • 4,882
  • 4
  • 38
  • 86
1
from bs4 import BeautifulSoup
soup = BeautifulSoup(web_pg)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
1

If you like jQuery use pyQuery

start with

from pyquery import PyQuery as pq

d = pq(web_pg)

or even

from pyquery import PyQuery as pq

d = pq(url="http://www.federalreserve.gov/monetarypolicy/beigebook/beigebook201301.htm")

Now d is like the $ in jQuery:

p = d("#hello") # get element with id="hello"
print p.html() # print as html

p = d('#content p:first') # get first <p> from element with id="content"
print p.text() # print as text
furas
  • 134,197
  • 12
  • 106
  • 148
  • ..strong upvote - pyquery is the easiest solution for painless html munging..if the direct `pq(url=...)` fails (f.e. lxml complaining about unsupported "**Unicode strings with encoding declaration**"), fetch the file first via `urllib.urlopen(url).read()`, then feed it to pyquery. – eMPee584 Jul 09 '13 at 18:30
1

You can also use RE--Regular expression to parse this information (in fact it can parse all text), it is faster than BeautifulSoup and others, but in the same time, regular expression is harder to learn than others

Here is DOCUMENT

example:

import re
p = re.compile(r'<p>(.*?)</p>')
content = r'<p> something </p>'
data = re.findall(p, content)
print data

it print:

[' something ']

This example can extract content which between <p> and </p>

It's just a very simple example to regular expression.

It's very import to use regular expression because it can do more thing than others.

So, just learn it!

sdvcrx
  • 11
  • 2