Scrape a form on incorrect web page

Question

I'm trying to scrape a html form using robobrowser with python 3.4. I use the default html parser:

self._browser = RoboBrowser(history=True, parser="html.parser")

It works fine for correct web pages but now I have to parse incorrectly written page. Here is the html fragment:

<form method="post"  action="decide.php?act=submit_advance">
    <table  class="td_advanced">
    <tr class="td_advance">
    <td colspan="4" class="td_advance"></strong><br></td>
    <td colspan="3" class="td_left">Case sensitive:<br><br></td>
    <td><input type="checkbox" name="case_sensitive" /><br><br></td>
[...]
</form>

The closing strong tag is incorrect. This error prevents the parser from read all inputs following this incorrect tag:

form = self._browser.get_form()
print(form)
>>> <RoboForm>

Any suggestions?

If it's a bug in robobrowser, you can submit an issue on github. https://github.com/jmcarp/robobrowser — Håken Lid, May 14 '16 at 11:58
I think beautifulsoup is supposed to handle tag soup, so that would be an option to consider. — Ecko, May 14 '16 at 12:03

score 0 · Answer 1 · answered May 15 '16 at 05:59

I have found the solution myself. The comment about beautifulsoup was helpful and took my search to a proper way.

The solution is : use another html parser. I tried with lxml and it works for me.

self._browser = RoboBrowser(history=True, parser="lxml")

As PyPI doesn't currently have lxml installer working with my python version, I downloaded it from here: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

Scrape a form on incorrect web page

1 Answers1