SGML Parser in Python

Question

I am completely new to Python. I have the following code:

class ExtractTitle(sgmllib.SGMLParser):

def __init__(self, verbose=0):

   sgmllib.SGMLParser.__init__(self, verbose)

   self.title = self.data = None

def handle_data(self, data):

  if self.data is not None:
    self.data.append(data)

def start_title(self, attrs):
 self.data = []

def end_title(self):

  self.title = string.join(self.data, "")

raise FoundTitle # abort parsing!

which extracts the title element from SGML, however it only works for a single title. I know I have to overload the unknown_starttag and unknown_endtag in order to get all titles but I keep getting it wrong. Help me please!!!

I have a large text file with SGML where I have tags of the format new title
new text
. I want my code to be able to give me this result in another file: new text — afg102, Jan 08 '11 at 09:14

Chris Morgan · Answer 1 · 2011-01-08T10:15:55.930

Beautiful Soup is one way you could parse it nicely (and it's the way I'd always do it, unless there was some extremely good reason not to do it that way, myself). It's a lot simpler and more readable than using SGMLParser.

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''<post id='100'> <title> new title </title> <text> <p> new text </p> </text> </post>''')
>>> soup('post')  # soup.findAll('post') is equivalent
[<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>]
>>> for post in soup('post'):
...     print post.findChild('text')
...
<text> <p> new text </p> </text>

Once you've got it at this stage, you can do various things with it, depending on how you want it.

>>> post = soup.find('post')
>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post_text = post.findChild('text')
>>> post_text
<text> <p> new text </p> </text>

You could be wanting to strip out HTML.

>>> post_text.text
u'new text'

Or perhaps look at the contents...

>>> post_text.renderContents()
' <p> new text </p> ']
>>> post_text.contents
[u' ', <p> new text </p>, u' ']

There are all sorts of things you could want to do. If you're more specific - in especial, providing real data - it helps.

When it comes to manipulating the tree, you can do that too.

>>> post
<post id="100"> <title> new title </title> <text> <p> new text </p> </text> </post>
>>> post.title  # Just as good as post.findChild('title')
<title> new title </title>
>>> post.title.extract()  # Throws it out of the tree and returns it but we have no need for it
<title> new title </title>
>>> post  # title is gone!
<post id="100">  <text> <p> new text </p> </text> </post>
>>> post.findChild('text').replaceWithChildren()  # Thrown away the <text> wrapping
>>> post
<post id="100">   <p> new text </p>  </post>

And so, finally, you'd have something like this:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('''
... <post id='100'> <title> new title 100 </title> <text> <p> new text 100 </p> </text> </post>
... <post id='101'> <title> new title 101 </title> <text> <p> new text 101 </p> </text> </post>
... <post id='102'> <title> new title 102 </title> <text> <p> new text 102 </p> </text> </post>
... ''')
>>> for post in soup('post'):
...     post.title.extract()
...     post.findChild('text').replaceWithChildren()
... 
<title> new title 100 </title>
<title> new title 101 </title>
<title> new title 102 </title>
>>> soup

<post id="100">   <p> new text 100 </p>  </post>
<post id="101">   <p> new text 101 </p>  </post>
<post id="102">   <p> new text 102 </p>  </post>

@virhilo: "slow"? Perhaps in processing it is, but in development time it tends to be brilliantly fast. And that's generally what matters now. And "dead"? It's got practically all that's desired, there's not anything much extra to *do* for it. The fact that it doesn't have any active development going on (which I will grant you) doesn't bother me much at all. — Chris Morgan, Jan 08 '11 at 10:17
Thanks guys it working now :) any ideas on how to write the results to an external file pls? — afg102, Jan 08 '11 at 10:38
@afg102: with what I've got, you can then write it to a file with `outfile = open('filename', 'w')`, `outfile.write(soup.renderContents())` (`unicode(soup)` would work just as well, too) — Chris Morgan, Jan 08 '11 at 11:26

Andrew Dalke · Answer 2 · 2011-01-08T16:06:20.593

Your code resets the "title" attribute every time the end_title() is called. The title you end up with is therefore the last title in the document.

What you need to do is store a list of all the titles you find. In the following, I also reset data to None (so you don't collect text data outside of title elements) and I used "".join instead of string.join because your use of the latter is considered old-fashioned

class ExtractTitle(sgmllib.SGMLParser):
  def __init__(self, verbose=0):
    sgmllib.SGMLParser.__init__(self, verbose)
    self.titles = []
    self.data = None

  def handle_data(self, data):
    if self.data is not None:
      self.data.append(data)

  def start_title(self, attrs):
    self.data = []

  def end_title(self):
    self.titles.append("".join(self.data))
    self.data = None

and here it is in use:

>>> parser = ExtractTitle()
>>> parser.feed("<doc><rec><title>Spam and Eggs</title></rec>" +
...             "<rec><title>Return of Spam and Eggs</title></rec></doc>")
>>> parser.close()
>>> parser.titles
['Spam and Eggs', 'Return of Spam and Eggs']
>>>

How did it not work? What's your test case and how did it fail? I added an example to show that it does work for me. — Andrew Dalke, Jan 08 '11 at 16:07
ok great :D i had a slight error in my code. Thanks a lot! Do you have any idea about another question I posted pls? http://stackoverflow.com/questions/4634787/freqdist-with-nltk — afg102, Jan 08 '11 at 16:40

score 1 · Answer 3 · answered Jan 08 '11 at 09:35

1

use lxml instead of SGMLParser:

>>> posts = """
... <post id='100'> <title> xxxx </title> <text> <p> yyyyy </p> </text> </post>
... <post id='101'> <title> new title1 </title> <text> <p> new text1 </p> </text> </post>
... <post id='102'> <title> new title2 </title> <text> <p> new text2 </p> </text> </post>
... """
>>> from lxml import html
>>> parsed = html.fromstring(posts)
>>> new_file = html.Element('div')
>>> for post in parsed:
...     post_id = post.attrib['id']
...     post_text = post.find('text').text_content()
...     new_post = html.Element('post', id=post_id)
...     new_post.text = post_text
...     new_file.append(new_post)
... 
>>> html.tostring(new_file)
'<div><post id="100"> yyyyy  </post><post id="101"> new text1  </post><post id="102"> new text2  </post></div>'
>>>

answered Jan 08 '11 at 09:35

virhilo

6,568
2
29
26

thanks for your reply. I am trying to extract from a file so I did : filexy = open(fileurl) and posts = filexy.read() and then your code. However for some reason it is only showing the same text (i.e. it is not looping through all the tags) Do you have any idea? Thanks – afg102 Jan 08 '11 at 10:17
could you paste some example document? – virhilo Jan 08 '11 at 10:20
I was wondering if maybe you guys were into NLTK. I'm using the function FreqDist to get the frequency of the words in my text obtained from the file I generated. I tried this: filey = open(fileurl") p= filey.read() fdist = FreqDist(p) vocab = fdist.keys() vocab[:30] -> but the result is a list of single letters whereas in the example from the nltk website, that should result in a list of whole words. Any help please? – afg102 Jan 08 '11 at 14:47
That'd be a new question, no? – TryPyPy Jan 09 '11 at 06:20

SGML Parser in Python

3 Answers3