How to check which line from HTML triggers error?

Question

I have the following code that removes duplicates paragraphs from html file.

from bs4 import BeautifulSoup

fp = open("Input.html", "rb")
soup = BeautifulSoup(fp, "html5lib")

elms = []
for elem in soup.find_all('font'):
    if elem not in elms:
        elms.append(elem)
    else:
        target =elem.findParent().findParent()
        target.decompose()
print(soup.html)

Is almost working, but for some elements I get this error

attributeerror: 'nonetype' object has no attribute 'findparent'

Is there a way to print the line number within the HTML file where the error happens to check what is the format?

the structure of elements for which the code doesn't have issues is like this

<!DOCTYPE html>
<html>
  <body>
      <p align="left">
        <b><font face="Times New Roman" size="5" color="red">Some text</font></b> 
      </p>
  </body>
</html>

But since the file is a kind of large, I don't have identified the structure of the elements where the code stucks.

I would recommend using a context manager to handle file objects. — AMC, Mar 04 '20 at 20:58

score 1 · Answer 1 · answered Mar 04 '20 at 20:43

Since you're using the html5lib parser you have access to the linenumber if you're using BeautifulSoup version 4.8.1 or higher as described in the docs:

The html.parser and html5lib parsers can keep track of where in the original document each Tag was found. You can access this information as Tag.sourceline (line number) and Tag.sourcepos (position of the start tag within a line) […]

In your example you can easily access these information:

from bs4 import BeautifulSoup

html = """<!DOCTYPE html>
<html>
  <body>
      <p align="left">
        <b><font face="Times New Roman" size="5" color="red">Some text</font></b> 
      </p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "html5lib")

for elem in soup.find_all('font'):
    print(elem.sourceline, elem.sourcepos, elem.string)

This will output 5 60 Some text, where the first number is your linenumber.

If there is any potential error, e.g. getting a NoneType, you should take care of it before reaching the error. So instead of doing this:

target = elem.findParent().findParent()

you can check first, if you get a result for your first findParent()-method, and then do the second request, e.g.:

target = elem.findParent()
err_line, err_source, err_str = target.sourceline, target.sourcepos, target.string
if target:
    target = target.findParent()
else:
    print(f"Error near line {err_line} ({err_source}). Last good text: {err_str}")

Thanks for yout answer. I´ve tried your code. First part prints the sourceline, sourcepos and string correctly. The second code to show in which line of HTML file the error happens it doesn´t show any output. Playing around with your code and my code ( the code I show on original post), I see that the error appears when is present the line `target.decompose()`. If this line is not present, there is not any error shown. What more can be done to detect where the error happens? thanks — Ger Cas, Mar 05 '20 at 01:09
I guess that your `target` object -- right before `target.decompose()` -- is already a `NoneType` object? The presented code was only an example to show that you have to do this **before every soup-method**, even before my presented `target.findParent()` in the *if-clause*. I haven't done it in line 4 of my last code snippet, because I had to assume where exactly the error occurs in the code. — colidyre, Mar 05 '20 at 04:16

How to check which line from HTML triggers error?

1 Answers1