0

I am trying a simple parsing of an html file which contains unit test results in the body

url = urllib2.urlopen('file:/randomstuff/results.txt').read()
soup = BeautifulSoup(url, 'lxml')
save = soup.body.findAll(text = re.compile("failed"))

the best I can get out of this is 1 instance of the text (when there are closer to 50) with lxml and html5lib. The other parsers find none. Is there anyway I can work around the broken html?

an example of the body is this

********* Finished testing of LogLevelTypeTest *********
********* Start testing of AppLoggerConfigTest *********
Config: Using QTest library 4.8.1, Qt 4.8.1
PASS : initTestCase
PASS : testSetFromEnvironment
PASS : cleanupTestCase
Totals: 3 passed, 0 failed, 0 skipped

Html Looks like this

<html>
   <head></head>
   <body>
   <pre style="word-wrap: break-word; white-space: pre-wrap;">
      "Common Unit Test Results"
      ...
      ...
   </pre>
 </body>

sf8193
  • 575
  • 1
  • 6
  • 25
  • 1
    We would need to take a look at the problematic HTML in order to help you. Please, post the file, so we can analyze it. – Haroldo_OK Sep 13 '17 at 18:35
  • What do you intend to do with the text, BTW? – Haroldo_OK Sep 13 '17 at 18:39
  • @Haroldo_OK I intend to add them up and see how many cases were passed and how many failed. It is inside of the html body – sf8193 Sep 13 '17 at 18:41
  • Definitely a case for regex; maybe something like `r' (\d+) failed,'`; take a look at https://docs.python.org/2/library/re.html; perhaps you could still use Beautifulsoup for extracting the contents of the `
    ` tag before using the regex.
    – Haroldo_OK Sep 13 '17 at 18:44
  • so when I do soup.prettify() I see the text fine, it is just parsing that text that doesn't seem to work – sf8193 Sep 13 '17 at 18:49

0 Answers0