1

I have the following sample HTML table from a html file.

<table>
    <tr>
        <th>Class</th>
        <th class="failed">Fail</th>
        <th class="failed">Error</th>
        <th>Skip</th>
        <th>Success</th>
        <th>Total</th>
    </tr>
        <tr>
            <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
            <td class="failed">1</td>
            <td class="failed">9</td>
            <td>0</td>
            <td>219</td>
            <td>229</td>
        </tr>
    <tr>
        <td><strong>Total</strong></td>
        <td class="failed">1</td>
        <td class="failed">9</td>
        <td>0</td>
        <td>219</td>
        <td>229</td>
    </tr>
</table>

I am trying to print the text from the <td> tags where <td> starts from: Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2

I do not want to include the text from the <td> tags where <td> starts from:

<td><strong>Total</strong></td>

My code is printing the text from every single <td> tag:

def extract_data_from_report():
    html_report = open(r"E:\SeleniumTestReport.html",'r').read()
    soup = BeautifulSoup(html_report, "html.parser")
    th = soup.find_all('th')
    td = soup.find_all('td')

    for item in th:
        print item.text,
    print "\n"
    for item in td:
        print item.text,

My desired output:

Class               Fail Error Skip Success Total 
Regression_TestCase 1    9     0    219     229 
Mr Lister
  • 45,515
  • 15
  • 108
  • 150
Riaz Ladhani
  • 3,946
  • 15
  • 70
  • 127

1 Answers1

1

You can find all rows (tr elements) except the first one (to skip the headers) and the last one - the "total" row. Sample implementation that produces a list of dictionaries as a result:

from pprint import pprint

from bs4 import BeautifulSoup


data = """
<table>
    <tr>
        <th>Class</th>
        <th class="failed">Fail</th>
        <th class="failed">Error</th>
        <th>Skip</th>
        <th>Success</th>
        <th>Total</th>
    </tr>
        <tr>
            <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
            <td class="failed">1</td>
            <td class="failed">9</td>
            <td>0</td>
            <td>219</td>
            <td>229</td>
        </tr>
    <tr>
        <td><strong>Total</strong></td>
        <td class="failed">1</td>
        <td class="failed">9</td>
        <td>0</td>
        <td>219</td>
        <td>229</td>
    </tr>
</table>"""

soup = BeautifulSoup(data, "html.parser")

headers = [header.get_text(strip=True) for header in soup.find_all("th")]
rows = [dict(zip(headers, [td.get_text(strip=True) for td in row.find_all("td")]))
        for row in soup.find_all("tr")[1:-1]]

pprint(rows)

Prints:

[{u'Class': u'Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2',
  u'Error': u'9',
  u'Fail': u'1',
  u'Skip': u'0',
  u'Success': u'219',
  u'Total': u'229'}]
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • That's great. Thank you. How can i put all the text on 1 line. The numbers on the next new line. The numbers should be below each text. E.g. 9 should be underneath Error. 1 should be underneath Fail. – Riaz Ladhani May 14 '16 at 22:06
  • @RiazLadhani okay, could you provide an example desired output? Thanks. – alecxe May 14 '16 at 22:11
  • Please see my question. I have amended the desired output at the end of my question. Thanks. – Riaz Ladhani May 14 '16 at 22:17
  • @RiazLadhani ah, got it. I think you should look into pretty-printing in a tabular form: http://stackoverflow.com/questions/9535954/python-printing-lists-as-tabular-data, or use the pandas dataframe that can also be "table-printed"..hope that helps. – alecxe May 14 '16 at 22:19