Extracting text from damaged HTML?

Question

DRM is a plague even in the books industry. Last week I discovered many of my Kindle annotations were missing because a publisher sought fit to limit annotations to 10% of the book.

I've discovered tools for converting the Mobi book file to HTML. I've also used the location data (thankfully this wasn't missing) to extract the appropriate chunks of raw html. My problem now is that I have a lot of incomplete markup language to deal with.

Example:

></h1><div height="3em"></div> <p height="0em" width="1em" align="justify"><em>A Pocket Mirror for Heroes</em> is a book of stratagems for reaching excellence in a competitive world ruled by appearances and, often, deceit.</p><div height="0em"></div> <p height="0em" width="1em" align="justify">It is a <em>mirror</em> because it reflects &#x201C;the person you are or the one you ought to be.&#x201D; A <em>pocket</em> mirror because its author took the time to be brief. A mirror for <em>heroes</em> because it provides a vivid image of ethical and moral perfection. For the author, a hero is &#x201C;the consummate person, ripe and perfect: accurate in judgment, mature in taste, attentive in listening, wise in sayings, shrewd in deeds, the cente

This is because the location data in Kindle only corresponds to 150 byte chunks of HTML data. This means there's a lot of imprecision.

I'd like to clean this up. Does anyone have any suggestions? I'd prefer to use Python if possible.

Edit: What also might make sense is to use a tool that you can give character offsets to and it figures out how to extract something legible from it. Does something like that exist?

This was a poorly chosen example. But there are more extensive passages where many tags are unclosed or contain half of attributes. I'll update the post. — veta, Jun 25 '15 at 06:23

score 2 · Accepted Answer · answered Jun 25 '15 at 06:19

2

BeautifulSoup can parse malformed HTML and it's pretty robust.

>>> html = "<p>Para 1<p>Para 2<blockquote>Quote 1<blockquote>Quote 2"
>>> soup = BeautifulSoup(html)
>>> print(soup.prettify())
<p>
 Para 1
 <p>
  Para 2
  <blockquote>
   Quote 1
   <blockquote>
    Quote 2
   </blockquote>
  </blockquote>
 </p>
</p>

answered Jun 25 '15 at 06:19

fferri

18,285
5
46
95

Yes but isn't that technically properly formed HTML as per the HTML documentation? That stuff slides in HTML but not XHTML. In any case it's a good idea and I'll try it out. I will likely need something else to do additional munging. – veta Jun 25 '15 at 06:26
Maybe it's robust, but it doesn't seem to care about HTML rules. P inside another? – Sami Kuhmonen Jun 25 '15 at 06:28
Not the implicit ones (like "any open `
` will implicitly close any previously open `
`"). But it will be able to parse broken HTML (even chunks of it), that's what's important.
– fferri Jun 25 '15 at 06:30
This as worked surprisingly well! I still need to resolve cases where I have attributes being interpreted as text though. My thinking is maybe I can check if I encounter a ">" before a "<" and if so backtrack until I find the opening "<". BeautifulSoup might be able to take it from there. – veta Jun 25 '15 at 06:46
I've done as good as I can do, it looks great now. Thank you mescalinum and BS4 :) – veta Jun 25 '15 at 07:22

Extracting text from damaged HTML?

1 Answers1