10

this html is here :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><META http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body>

    <div bgcolor="#48486c">

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" background="http://title.jpg" height="130">

            <tr height="129">

                <td width="719" height="129"></td>

                <td width="1" height="129"></td>

            </tr>

            <tr height="1">

                <td width="720" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

        <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" height="203">

            <tr height="20">

                <td width="719" height="20"></td>

                <td width="1" height="20"></td>

            </tr>

            <tr height="69">

                <td width="719" height="69" valign="top" align="left">

                    <table width="719" border="1" cellspacing="2" cellpadding="0">

                        <tr>

                            <td bgcolor="a5fdf8" width="390"><b>Stream Name</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Status</b></td>

                            <td bgcolor="a5fdf8" width="61"><b>Duration</b></td>

                            <td bgcolor="a5fdf8" width="185"><b>Start</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="390">c:\streams\ours\Sony_AVCHD_<WBR>Test_Discs_60Hz_00001.m2ts</td>

                            <td width="61"><font color="#D0D0D0">----</font></td>

                            <td width="61">00:00:02</td>

                            <td width="185">2010/06/15-15:06:17</td>

                        </tr>

                    </table>

                </td>

                <td width="1" height="69"></td>

            </tr>

            <tr height="113">

                <td width="720" height="113" colspan="2" valign="top" align="left">

                    <table width="721" border="1" cellspacing="2" cellpadding="0">

                        <tr bgcolor="a5fdf8">

                            <td width="299"><b>Test Category</b></td>

                            <td width="61"><b>Error</b></td>

                            <td width="62"><b>Warning</b></td>

                            <td width="275"><b>Details</b></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">All Tests (Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ETSI TR-101-290 Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  ISO/IEC Transport Stream Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">  System Data T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">  Prog(1)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    VES(0xe0)</font></td>

                            <td width="61"><font color="#ff0000">34787</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      H.264/AVC Conformance</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_Conf.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Sequence</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Picture</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Slice</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Macroblock</font></td>

                            <td width="61"><font color="#ff0000">34718</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        Block</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#1010F0">      HRD Tests</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275">

                                <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_HRD.txt</font></a><br>

                            </td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#101010">        HRD level</font></td>

                            <td width="61"><font color="#ff0000">69</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Video T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#099eac">    AES(0xfd)</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="61"><font color="#000000">0</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#808080">      Audio Level Tests</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="61"><font color="#808080">Disabled</font></td>

                            <td width="275"></td>

                        </tr>

                        <tr bgcolor="white">

                            <td width="299"><font color="#800000">      Audio T-STD Tests</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="61"><font color="#800000">No Lic</font></td>

                            <td width="275"></td>

                        </tr>

                    </table>

                </td>

            </tr>

            <tr height="1">

                <td width="719" height="1"></td>

                <td width="1" height="1"></td>

            </tr>

        </table>

    </div>



</body></html>

has any python lib to do this ?

thanks

zjm1126
  • 63,397
  • 81
  • 173
  • 221

3 Answers3

13

BeautifulSoup gets you almost all the way there:

>>> import BeautifulSoup
>>> f = open('a.html')
>>> soup = BeautifulSoup.BeautifulSoup(f)
>>> f.close()
>>> g = open('a.xml', 'w')
>>> print >> g, soup.prettify()
>>> g.close()

This closes all tags properly. The only issue remaining is that the doctype remains HTML -- to change that into the doctype of your choice, you only need to change the first line, which is not hard, e.g., instead of printing the prettified text directly,

>>> lines = soup.prettify().splitlines()
>>> lines[0] = ('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"'
                '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">')
>>> print >> g, '\n'.join(lines)
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
12

lxml works well:

from lxml import html, etree

doc = html.fromstring(open('a.html').read())
out = open('a.xhtml', 'wb')
out.write(etree.tostring(doc))
Ian Bicking
  • 9,762
  • 6
  • 33
  • 32
  • it has many file , and i can't find the etree method from lxml – zjm1126 Jun 25 '10 at 06:16
  • 2
    lxml.etree is a module, lxml.etree.tostring is a function. Maybe you have an installation problem? – Ian Bicking Jun 25 '10 at 17:48
  • This solution works perfect when someone is dealing with autogenerated html pages (e.g. java doc in android for parsing methods and variables). BeautifulSoup's soup.prettify() will add unwanted spaces. This solution won't add extra spaces when you don't want them to be there. – Pradeep Singh Aug 25 '20 at 17:39
0

To piggyback off @Alex Martelli, as of Python 2.5, there is an xml module that comes baked into the standard library:

https://docs.python.org/3.6/library/xml.html

You could strip all HTML tags off, then format into xml and use the baked in XML library instead of bringing in another dependency. This is only advisable if you trust the source of the XML as you would be susceptible to all the standard XML vulnerabilities.

wski
  • 305
  • 1
  • 10