In Python SGMLParser, can't parse '
' without empty block but '
'

Question

In Python SGMLParser, I can't parse ' ' without empty block but ' '.

I can run this code for parsing html successfully, but if I change the tag from ' ' to ' ' as just deleting the empty block, it results that I can't parse the html successfully.

Any idea to solve it except for replacing the tag?

the successfull example.

# coding=utf-8  
from sgmllib import SGMLParser #get SGML

class ListName(SGMLParser):#parser class
    def reset(self):
        self.is_a = False #get <a></a>
        self.name=[] #get text
        SGMLParser.reset(self)
    def start_a(self,attrs):
        self.is_a = True

    def end_a(self):
        self.is_a = False

    def handle_data(self,data):
        if self.is_a:
            self.name.append(data)

if __name__ == '__main__':
    urls='''
    <tr>
<td height="207" colspan="2" align="left" valign="top" class="normal">
<p>Damien Rice - 《0》 </p>
<a href="http://galeki.xy568.net/music/Delicate.mp3">1. Delicate</a><br />
<a href="http://galeki.xy568.net/music/Volcano.mp3">2. Volcano</a><br />
<a href="http://galeki.xy568.net/music/The Blower's Daughter.mp3">3. The Blower's Daughter</a><br />
<a href="http://galeki.xy568.net/music/Cannonball.mp3">4. Cannonball </a><br />
<a href="http://galeki.xy568.net/music/Older Chests.mp3">5. Order Chests</a><br />
<a href="http://galeki.xy568.net/music/Amie.mp3">6. Amie</a><br />
<a href="http://galeki.xy568.net/music/Cheers Darlin'.mp3">7. Cheers Darling</a><br />
<a href="http://galeki.xy568.net/music/Cold Water.mp3">8. Cold water</a><br />
<a href="http://galeki.xy568.net/music/I Remember.mp3">9. I remember</a><br />
<a href="http://galeki.xy568.net/music/Eskimo.mp3">10. Eskimo</a></p>
</td>
</tr>
    '''
listname=ListName() #init parser
listname.feed(urls) #run parser
print listname.name
listname.close()

the result is :

['1. Delicate', '2. Volcano', "3. The Blower's Daughter", '4. Cannonball ', '5. Order Chests', '6. Amie', '7. Cheers Darling', '8. Cold water', '9. I remember', '10. Eskimo']

the wrong example:

# coding=utf-8  
from sgmllib import SGMLParser #get SGML

class ListName(SGMLParser):#parser class
    def reset(self):
        self.is_a = False #get <a></a>
        self.name=[] #get text
        SGMLParser.reset(self)
    def start_a(self,attrs):
        self.is_a = True

    def end_a(self):
        self.is_a = False

    def handle_data(self,data):
        if self.is_a:
            self.name.append(data)

if __name__ == '__main__':
    urls='''
    <tr>
<td height="207" colspan="2" align="left" valign="top" class="normal">
<p>Damien Rice - 《0》 </p>
<a href="http://galeki.xy568.net/music/Delicate.mp3">1. Delicate</a><br/>
<a href="http://galeki.xy568.net/music/Volcano.mp3">2. Volcano</a><br/>
<a href="http://galeki.xy568.net/music/The Blower's Daughter.mp3">3. The Blower's Daughter</a><br/>
<a href="http://galeki.xy568.net/music/Cannonball.mp3">4. Cannonball </a><br/>
<a href="http://galeki.xy568.net/music/Older Chests.mp3">5. Order Chests</a><br/>
<a href="http://galeki.xy568.net/music/Amie.mp3">6. Amie</a><br/>
<a href="http://galeki.xy568.net/music/Cheers Darlin'.mp3">7. Cheers Darling</a><br/>
<a href="http://galeki.xy568.net/music/Cold Water.mp3">8. Cold water</a><br/>
<a href="http://galeki.xy568.net/music/I Remember.mp3">9. I remember</a><br/>
<a href="http://galeki.xy568.net/music/Eskimo.mp3">10. Eskimo</a></p>
</td>
</tr>
    '''
listname=ListName() #init parser
listname.feed(urls) #run parser
print listname.name
listname.close()

the result is:

['1. Delicate']

Wander Nauta · Accepted Answer · 2015-10-26T10:09:16.940

It may be that the   bit is being parsed as a net-enabling start tag.

SGML has quite a few shortcuts ('minimizations') that allow you to omit parts of its syntax. NET, for 'null end tag', is one of those; it's a way to avoid writing an end tag altogether and use a slash instead. For example, in SGML, writing

<ISBN/0 201 17535 5/

is the same as writing

<ISBN>0 201 17535 5</ISBN>

Replacing all the   in your wrong example with <br// gives the expected result.

That said, both your inputs look like almost-valid HTML, not SGML - have you considered using a HTML or XML parser instead? For example, with Beautiful Soup:

from bs4 import BeautifulSoup
urls = ...
soup = BeautifulSoup(urls, "html.parser")
print([a.get_text() for a in soup.find_all('a')])

In Python SGMLParser, can't parse '' without empty block but ''

1 Answers1

In Python SGMLParser, can't parse '
' without empty block but '
'