3

I am parsing content using Python and Beautiful Soup then writing it to a CSV file, and have run into a bugger of a problem getting a certain set of data. The data is ran through an implementation of TidyHTML that I have crafted and then other not needed data is stripped out.

The issue is that I need to retrieve all data between a set of <h3> tags.

Sample Data:

<h3><a href="Vol-1-pages-001.pdf">Pages 1-18</a></h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
 September 14 1880. Discussion of curricular matters. Students are
 debarred from taking algebra until they have completed both mental
 and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
 <ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
 President's room of the University building; 11 October 1880. All
 members present; 18 October 1880. Regular meeting 2. Moved that the
 President wait on the property holders on 12th street and request
 them to abate the nuisance on their property; 25 October 1880.
 Moved that the senior and junior classes for rhetoricals be...</li></ul>
 <h3><a href="Vol-1-pages-019.pdf">Pages 19-33</a></h3>`

I need to retrieve all of the content between the first closing </h3> tag and the next opening <h3> tag. This shouldn't be hard, but my thick head isn't making the necessary connections. I can grab all of the <ul> tags but that doesn't work because there is not a one to one relationship between <h3> tags and <ul> tags.

The output I am looking to achieve is:

Pages 1-18|Vol-1-pages-001.pdf|content between and tags.

The first two parts have not been a problem but content between a set of tags is difficult for me.

My current code is as follows:

import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque

html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'

html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg"    alt="" />':''}

for infile in glob.glob( os.path.join(html_path, '*.html') ):
    print "current file is: " + infile

    html = open(infile).read()

    for i, j in html_cleanup.iteritems():
            html = html.replace(i, j)

    #parse cleaned up html with Beautiful Soup
    soup = BeautifulSoup(html)

    #print soup
    html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
                      quoting=csv.QUOTE_NONE, escapechar=' ')  
    #retrieve the string that has the page range and file name
    volume = deque()
    fileName = deque()
    summary = deque()
    i = 0
    for title in soup.findAll('a'):
            if title['href'].startswith('V'):
             #print title.string
             volume.append(title.string)
             i+=1
             #print soup('a')[i]['href']
             fileName.append(soup('a')[i]['href'])
             #print html_to_csv
             #html_to_csv.writerow([volume, fileName])

    #retrieve the summary of each archive and store
    #for body in soup.findAll('ul') or soup.findAll('ol'):
    #        summary.append(body)
    for body in soup.findAll('h3'):
            body.findNextSibling(text=True)
            summary.append(body)

    #print out each field into the csv file
    for c in range(i):
            pages = volume.popleft()
            path = fileName.popleft()
            notes = summary
            if not summary: 
                    notes = "help"
            if summary:
                    notes = summary.popleft()
            html_to_csv.writerow([pages, path, notes])
Roman Bodnarchuk
  • 29,461
  • 12
  • 59
  • 75
theMusician
  • 33
  • 1
  • 3
  • Try using this Xpath expression `html/body/h3[1]/a/@href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/@href` – RanRag Jan 04 '12 at 18:31
  • Not quite, it isn't returning any results but I did not know that you could use Xpath inside of findAll. I'll play around with this. Thank you. – theMusician Jan 04 '12 at 18:48
  • and why don't you give `lxml` a try because BSoup is unmaintained, slow and has ugly API. – RanRag Jan 04 '12 at 18:54
  • 2
    [even its own maintainer recommends moving on to other libraries](http://www.crummy.com/software/BeautifulSoup/3.1-problems.html) – RanRag Jan 04 '12 at 18:55
  • 1
    @RanRag: the maintainer says: *tl;dr: Use the 4.0 series instead.* and *This page was originally written in March 2009. Since then, the 3.2 series has been released, replacing the 3.1 series, and development of the 4.x series has gotten underway. This page will remain up for _historical purposes_.* – jfs Jan 04 '12 at 23:02

2 Answers2

2

Extract content between </h3> and <h3> tags:

from itertools import takewhile

h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
  # get elements in between
  between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
  # extract text
  print(''.join(getattr(el, 'text', el) for el in between_it))

The code assumes that all <h3> elements are siblings. If it is not the case then you could use h3.nextGenerator() instead of h3.nextSiblingGenerator().

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Being new to python I was not aware of the itertools library. That is a slick set of functions. The code does extract the content between the `` and `

    ` tags really well. Using soup.find only ever grabs the first h3 tag in a document, is there a way to increment the soup find function so that I can retrieve all of the content on the page between `

    ` tags? A for loop with a counter gives the same results as not using a loop because of the find function. I tried findAll() with no success because it conflicts with findNext().

    – theMusician Jan 05 '12 at 18:27
  • @theMusician: write a loop: `h3s = soup('h3'); \n for h3, h3next in zip(h3s, h3s[1:]): \n between_it = # the same as above ...` – jfs Jan 05 '12 at 19:36
  • As I understand the proposed code, zip creates a set of tuples based on closing and opening h3 tags. `h3 = soup.find('h3') # find the first

    h3next = h3.findNext('h3') # find next

    h3s = soup('h3') for h3, h3next in zip(h3s, h3s[1:]): # get elements in between between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator()) # extract text summary.append(''.join(getattr(el, 'text', el) for el in between_it))`

    – theMusician Jan 05 '12 at 21:48
  • As I understand the proposed code, zip creates a set of tuples based on closing and opening h3 tags. It seems that zip is raising a type error when it encounters empty elements, an issue that was supposedly removed in version 2.4. http://docs.python.org/library/functions.html#zip `h3 = soup.find('h3') # find the first

    h3next = h3.findNext('h3') # find next

    h3s = soup('h3') for h3, h3next in zip(h3s, h3s[1:]): # get elements in between between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())

    – theMusician Jan 05 '12 at 21:54
  • @theMusician: The docs refer to `zip()` (no arguments) case. It doesn't apply here. The code should not produce `TypeError`. I've updated the answer. – jfs Jan 05 '12 at 22:28
  • @theMusician: Try to understand what each element of the code does, experiment in interactive Python shell e.g., try [`L = range(10); print zip(L, L[1:])`](http://ideone.com/haHdO). Read [Python tutorial](http://docs.python.org/tut). If it is too difficult try http://learnpythonthehardway.org/ (type the code and see the results). – jfs Jan 05 '12 at 22:37
  • Thank you for the assistance. It still produces a TypeError, but that may be because of the input data. By far the most succinct answer. – theMusician Jan 06 '12 at 18:46
  • Create a [minimal example (input data, code, the exact traceback that you get)](http://sscce.org/) that shows `TypeError` and [update your question](http://stackoverflow.com/posts/8731848/edit) e.g., [`''.join([1])`](http://ideone.com/fqy9J) – jfs Jan 06 '12 at 19:42
  • The minimal example is here: http://pastebin.com/RCgzV22v as the error only occurs once I reach a certain amount of data. The exact error isTraceback (most recent call last): File "//psf/Host/Applications/MAMP/htdocs/uoassembly/python/archive_test_parse.py", line 529, in print(''.join(getattr(el, 'text', el) for el in between_it)) TypeError: sequence item 1: expected string or Unicode, NoneType found It is unicode that is being input, Beautiful Soup should output only unicode. – theMusician Jan 07 '12 at 00:53
  • @theMusician: 1. the amount doesn't matter, the structure of the html document does (the code assumes that all `

    ` elements are siblings, use `h3.nextGenerator()` otherwise). 2. Don't use `html_cleanup`, use `BeautifulSoup` capabilities to change the html.

    – jfs Jan 07 '12 at 11:18
0

If you try to extract data between <ul><li></ul></li> tags in lxml, it provides a great functionality of using CSSSelector

import lxml.html
import urllib
data = urllib.urlopen('file:///C:/Users/ranveer/st.html').read() //contains your html snippet
doc = lxml.html.fromstring(data)
elements = doc.cssselect('ul li') // CSSpath[using firebug extension]
for element in elements:
      print element.text_content()    

after executing the above code you will get all text between the ul,li tags. It is much cleaner than beautiful soup.

If you by any chance plan to use lxml than you can evaluate XPath expressions in the following way-

import lxml
from lxml import etree
content = etree.HTML(urllib.urlopen("file:///C:/Users/ranveer/st.html").read())
content_text = content.xpath("html/body/h3[1]/a/@href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/@href")
print content_text

You can change XPath according to your need.

RanRag
  • 48,359
  • 38
  • 114
  • 167
  • I tried the lxml library, and though it may be syntactically succinct, it seems to simply grab each ul element, rather than the ul elements between the heading tags. I'm going to pursue the xpath input. The sample I provided is just one piece of the document, I apologize if that was unclear, but when I iterate over the doc I can't get tripped up on multiple uls between the heading tags. – theMusician Jan 04 '12 at 21:16
  • Thanks RanRag for the assistance. The xpath using lxml does retrieve specific elements however, iteration does not seem to work well. The xpath doesn't accept variables, or at least I have yet to find it in the lxml documentation. It was a good road to travel though. – theMusician Jan 04 '12 at 23:07