1

I want to use Python to parse HTML markup, and given one of the resultant DOM tree elements, get the start and end offsets of that element within the original, unmodified markup.

For example, given the HTML markup (with \n EOL chars)

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.expected.resource"/>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>

(example with BeautifulSoup, but I'm not attached to any parser in particular)

>>> soup = bs4.BeautifulSoup(html_markup)
>>> title_tag = soup.find('title')
>>> get_offsets_in_markup(title_tag)  # <-------- how do I go about doing this?
(109, 139)  # <----- source mapping info I want to get
>>> html_markup[109:139]
'<title>No Longer Human</title>'

I don't see this functionality in the APIs of any of the Python HTML parsers available. Can I hack it into one of the existing parsers? How would I go about doing that? Or is there another, better approach?

I realize that str(soup_element) serializes the element back into markup (and I can hypothetically recurse down the tree saving the start and end indices as I go), but the markup returned by doing that, although semantically equivalent to the original, doesn't match the original char-for-char. None of the available Python parsers do.

midrare
  • 2,371
  • 28
  • 48
  • Might this be a case of the [XY Problem](https://xyproblem.info) ? – AMC Sep 26 '21 at 04:34
  • @AMC The context is that I have a bunch of ebook annotations, for each of which I need to find the corresponding XPath. Each annotation consists of the start and end byte offsets within the ebook's HTML markup (each ebook is represented as one long bytestring). I have no other data about the annotations, except for their start and end offsets. – midrare Sep 26 '21 at 04:51
  • Ouch, that sounds tough, good luck! – AMC Oct 08 '21 at 16:20
  • 1
    beautifulsoup can give you the line offsets through [line numbers](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#line-numbers), available only with `html.parser` and `html5lib` – diggusbickus Oct 08 '21 at 16:33

1 Answers1

0

You can use regular expression to find corresponding element's start and indexes, and use those indexes in original string to find data:

import re
from bs4 import BeautifulSoup
from pathlib import Path

def get_offsets_in_markup(tag, html_markup):
    elem = re.search(str(title_tag), html_markup)
    return elem.start(), elem.end()

html_markup = Path('test.html').read_text()
soup = BeautifulSoup(html_markup, 'lxml')

title_tag = soup.find('title')

indexes = get_offsets_in_markup(title_tag, html_markup)
# -> (109, 139)
given_text = html_markup[indexes[0]:indexes[1]]
# -> <title>No Longer Human</title>

This is how test.html looks like:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-gb">
  <head>
    <title>No Longer Human</title>
    <meta content="urn:uuid:6757faf0-eef1-45d9-b2b3-7462350db7ba" name="Adept.e$
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <link href="kindle:flow:0002?mime=text/css" rel="stylesheet" type="text/css"/>
  <link href="kindle:flow:0001?mime=text/css" rel="stylesheet" type="text/css"/>
  </head>
  <body class="calibre" aid="0">
  </body>
</html>
Rustam Garayev
  • 2,632
  • 1
  • 9
  • 13