none of the parsers are finding all beautiful soup python

Asked Sep 13 '17 at 18:33

Active Sep 13 '17 at 18:40

Viewed 55 times

I am trying a simple parsing of an html file which contains unit test results in the body

url = urllib2.urlopen('file:/randomstuff/results.txt').read()
soup = BeautifulSoup(url, 'lxml')
save = soup.body.findAll(text = re.compile("failed"))

the best I can get out of this is 1 instance of the text (when there are closer to 50) with lxml and html5lib. The other parsers find none. Is there anyway I can work around the broken html?

an example of the body is this

********* Finished testing of LogLevelTypeTest *********
********* Start testing of AppLoggerConfigTest *********
Config: Using QTest library 4.8.1, Qt 4.8.1
PASS : initTestCase
PASS : testSetFromEnvironment
PASS : cleanupTestCase
Totals: 3 passed, 0 failed, 0 skipped

Html Looks like this

<html>
   <head></head>
   <body>
   <pre style="word-wrap: break-word; white-space: pre-wrap;">
      "Common Unit Test Results"
      ...
      ...
   </pre>
 </body>

edited Sep 13 '17 at 18:40

asked Sep 13 '17 at 18:33

sf8193

1

We would need to take a look at the problematic HTML in order to help you. Please, post the file, so we can analyze it. – Haroldo_OK Sep 13 '17 at 18:35
What do you intend to do with the text, BTW? – Haroldo_OK Sep 13 '17 at 18:39
@Haroldo_OK I intend to add them up and see how many cases were passed and how many failed. It is inside of the html body – sf8193 Sep 13 '17 at 18:41
Definitely a case for regex; maybe something like `r' (\d+) failed,'`; take a look at https://docs.python.org/2/library/re.html; perhaps you could still use Beautifulsoup for extracting the contents of the `
```
` tag before using the regex.
```
– Haroldo_OK Sep 13 '17 at 18:44
so when I do soup.prettify() I see the text fine, it is just parsing that text that doesn't seem to work – sf8193 Sep 13 '17 at 18:49

none of the parsers are finding all beautiful soup python

0 Answers0